Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
This feature is in Beta.
This article describes serverless GPU compute on Databricks and provides recommended use cases, guidance for how to set up GPU compute resources, and feature limitations.
What is serverless GPU compute?
Serverless GPU compute is part of the Serverless compute offering. Serverless GPU compute is specialized for custom single and multi-node deep learning workloads. You can use serverless GPU compute to train and fine-tune custom models using your favorite frameworks and get state-of-the-art efficiency, performance, and quality.
Serverless GPU compute includes:
- An integrated experience across Notebooks, Unity Catalog, and MLflow: You can develop your code interactively using Notebooks.
- A10 GPU accelerators: A10 GPUs are designed to accelerate small to medium machine learning and deep learning workloads, including classic ML models and fine-tuning smaller language models. A10s are well-suited for tasks with moderate computational requirements.
- Multi-GPU and multi-node support: You can run distributed training workloads multiple GPUs and multiple nodes using the Serverless GPU Python API. See Distributed training.
The pre-installed packages on serverless GPU compute are not a replacement for Databricks Runtime ML. While there are common packages, not all Databricks Runtime ML dependencies and libraries are reflected in the serverless GPU compute environment.
Python environments on Serverless GPU compute
Databricks provides two managed environments to serve different use cases.
Note
Custom base environments are not supported for serverless GPU compute. Instead, use the default or AI environment, and specify additional dependencies directly in the Environments side panel or pip install them.
Default base environment
This provides a minimal environment with stable client API to ensure application compatibility. Only required Python packages are installed. This allows Databricks to upgrade the server independently, delivering performance improvements, security enhancements, and bug fixes without requiring any code changes to workloads. Choose this environment if you want to fully customize the environment for your training.
For more details, see the release notes:
AI environment
The Databricks AI environment is available in serverless GPU environment 4. The AI environment is built on top of the default environment with common runtime packages and packages specific to machine learning on GPUs. It contains popular machine learning libraries, including PyTorch, langchain, transformers, ray, and XGBoost for model training and inferences. Choose this environment if you'll be running training workloads.
For more details, see the release notes:
Recommended use cases
Databricks recommends serverless GPU compute for any model training use case that requires training customizations and GPUs.
For example:
- LLM Fine-tuning
- Computer vision
- Recommender systems
- Reinforcement learning
- Deep-learning-based time series forecasting
Requirements
- A workspace in one of the following Azure-supported regions:
eastuseastus2centralusnorthcentraluswestcentraluswestus
Set up serverless GPU compute
To connect your notebook to serverless GPU compute and configure the environment:
- From a notebook, click the Connect drop-down menu at the top and select Serverless GPU.
- Click the
to open the Environment side panel.
- Select A10 from the Accelerator field.
- Select None for the default environment or AI v4 for the AI environment from the Base environment field.
- If you chose None from the Base environment field, select the Environment version.
- Click Apply and then Confirm that you want to apply the serverless GPU compute to your notebook environment.
Note
Connection to your compute auto-terminates after 60 minutes of inactivity.
Add libraries to the environment
You can install additional libraries to the serverless GPU compute environment. See Add dependencies to the notebook.
Note
Adding dependencies using the Environments panel as seen in Add dependencies to the notebook is not supported for serverless GPU compute scheduled jobs.
Create and schedule a job
The following steps show how to create and schedule jobs for your serverless GPU compute workloads. See Create and manage scheduled notebook jobs for more details.
After you open the notebook you want to use:
- Select the Schedule button on the top right.
- Select Add schedule.
- Populate the New schedule form with the Job name, Schedule, and Compute.
- Select Create.
You can also create and schedule jobs from the Jobs and pipelines UI. See Create a new job for step-by-step guidance.
Distributed training
You can launch distributed training across multiple GPUs -- either within a single node or across multiple nodes -- using the Serverless GPU Python API. The API provides a simple, unified interface that abstracts away the details of GPU provisioning, environment setup, and workload distribution. With minimal code changes, you can seamlessly move from single-GPU training to distributed execution across remote GPUs from the same notebook.
The @distributed decorator works much like launching multi-node training with torchrun, but in
pure Python. For example, the snippet below distributes the hello_world function across 8 remote
A10 GPUs:
# Import the distributed decorator
from serverless_gpu import distributed
# Decorate the function with @distributed and specify the number of GPUs, the GPU type, and whether
# or not the GPUs are remote
@distributed(gpus=8, gpu_type='A10', remote=True)
def hello_world(name: str) -> None:
print('hello', name)
# Trigger the distributed execution of the hello_world function
hello_world.distributed('world')
When executedd, logs and outputs from all workers are collected and surfaced in the Experiment section of your workspace.
The API supports popular parallel training libraries such as Distributed Data Parallel (DDP), Fully Sharded Data Parallel (FSDP), DeepSpeed and Ray.
Start by importing the starter notebook to get hands-on with the API, then explore the notebook examples to see how it’s used in real distributed training scenarios using the various libraries.
For full details, refer to the Serverless GPU Python API documentation.
Limitations
- Serverless GPU compute only supports A10 accelerators.
- Private Link is not supported. Storage or pip repos behind Private Link are not supported.
- Serverless GPU compute is not supported for compliance security profile workspaces (like HIPAA or PCI). Processing regulated data is not supported at this time.
- For scheduled jobs on Serverless GPU compute, auto recovery behavior for incompatible package versions that are associated with your notebook is not supported.
- The maximum runtime for a workload is seven days. For model training jobs that exceed this limit, please implement checkpointing and restart the job once the maximum runtime is reached.
Troubleshoot issues on Serverless GPU Compute
If you encounter problems running workloads on Serverless GPU compute, see the troubleshooting guide for common issues, workarounds, and support resources.
Best practices checklist
Before you run a notebook using serverless GPU compute, check the following:
[ ] Environment: Ensure your libraries and packages are compatible with your selected serverless environment version.
[ ] Checkpoint storage: Check if you are saving checkpoints to DBFS or leave it unspecified to let MLflow default to DBFS.
- [ ] Avoid using
/Workspace, which has a 500 MB per file size limit. - [ ] Verify checkpointing sooner. For example, after 50 steps, instead of 1 epoch.
- [ ] Avoid using
[ ] MLFlow logging: Set the logger step parameter to a sufficiently large number of batches to avoid logging every batch (default) and exceeding the 1M metric step limit.
[ ] Multi-node launch: Add retries or a longer timeout to avoid barrier timeout issues.
The following code shows how to implement these best practices:
# Settings for a quick run to verify logging and checkpointing
# If using transformers
from transformers import TrainingArguments
training_args = TrainingArguments(
# checkpoint to /Vol if no symlinks created
output_dir = "/Volumes/your_catalog/your_schema/your_vol/your_model",
logging_strategy = "steps",
logging_steps = 10, # avoid exceeding mlflow 1M metric step limit
# save checkpoints earlier after 100 steps to verify checkpointing
save_strategy = "steps",
save_steps = 100,
# terminate job earlier after 200 steps as a trial run to verify logging and checkpointing
max_steps = 200,
...
)
Notebook examples
Below are various notebook examples that demonstrate how to use Serverless GPU compute for different tasks.
| Task | Description |
|---|---|
| Large language models (LLMs) | Examples for fine-tuning large language models including parameter-efficient methods like Low-Rank Adaptation (LoRA) and supervised fine-tuning approaches. |
| Computer vision | Examples for computer vision tasks including object detection and image classification. |
| Deep learning based recommender systems | Examples for building recommendation systems using modern deep learning approaches like two-tower models. |
| Classic ML | Examples for traditional machine learning tasks including XGBoost model training and time series forecasting. |
| Multi-GPU and multi-node distributed training | Examples for scaling training across multiple GPUs and nodes using the Serverless GPU API, including distributed fine-tuning. |
Below are notebook examples that demostrate how to use various distributed training libraries on Serverless GPU compute for multi-GPU training.
| Library | Description |
|---|---|
| Distributed Data Parallel (DDP) | Examples for training models using distributed data parallelism. |
| Fully Sharded Data Parallel (FSDP) | Examples for training models using fully sharded data parallelism. |
| DeepSpeed | Examples for training models using optimizations from DeepSpeed library. |
| Ray | Examples for training models using Ray library. |