Multi-GPU and multi-node distributed training

2025-10-20

Important

This feature is in Beta.

This page has notebook examples for multi-node and multi-GPU distributed training using Serverless GPU compute. These examples demonstrate how to scale training across multiple GPUs and nodes for improved performance.

Before running these notebooks, see the Best practices checklist.

Serverless GPU API: A10 starter

The following notebook has a basic example of how to use the Serverless GPU Python API to launch multiple A10 GPUs for distributed training.

Notebook

Get notebook

Distributed training using MLflow 3.0

This notebook introduces best practices for using MLflow on Databricks for deep learning use cases on serverless GPU compute. This notebook uses the Serverless GPU API to launch distributed training of a simple classification model on a remote A10 GPU. The training is tracked as an MLflow run.

Notebook

Get notebook

Distributed training using PyTorch's Distributed Data Parallel (DDP)

The following notebook demonstrates distributed training of a simple multilayer perceptron (MLP) neural network using PyTorch's Distributed Data Parallel (DDP) module on Azure Databricks with serverless GPU compute.

Notebook

Get notebook

Distributed training using PyTorch's Fully Sharded Data Parallel (FSDP)

The following notebook demostrates distributed training of a Transformer model with 10 million parameters using PyTorch's Fully Sharded Data Parallel (FSDP) module on Azure Databricks with serverless GPU compute.

Notebook

Get notebook

Distributed training using Ray

This notebook demonstrates distributed training of a PyTorch ResNet model on the FashionMNIST dataset using Ray Train and Ray Data on Databricks Serverless GPU clusters. It covers setting up Unity Catalog storage, configuring Ray for multi-node GPU training, logging and registering models with MLflow, and evaluating model performance.

Notebook

Get notebook

Distributed supervised fine-tuning using TRL

This notebook demonstrates how to use the Serverless GPU Python API to run supervised fine-tuning (SFT) using the TRL library with DeepSpeed ZeRO Stage 3 optimization on a single node A10 GPU. This approach can be extended to multi-node setups.

Notebook

Get notebook

Distributed Training of OpenAI gpt-oss 20B on 8 H100 using TRL and DDP

This notebook demonstrates how to use the Serverless GPU Python API to run supervised fine-tuning (SFT) on the gpt-oss 20B model from Hugging face using the TRL library. We leverage DDP across all 8 H100 GPU's on the node to scale the global batch size.

Notebook

Get notebook

Distributed Training of OpenAI gpt-oss 120B on 8 H100 using TRL and FSDP

This notebook demonstrates how to use the Serverless GPU Python API to run supervised fine-tuning (SFT) on the gpt-oss 120B model from Hugging face using the TRL library. We leverage FSDP to reduce memory consumption and DDP to scale the global batch size.

Notebook

Get notebook

Feedback

Was this page helpful?

Share via

Multi-GPU and multi-node distributed training

Serverless GPU API: A10 starter

Notebook

Distributed training using MLflow 3.0

Notebook

Distributed training using PyTorch's Distributed Data Parallel (DDP)

Notebook

Distributed training using PyTorch's Fully Sharded Data Parallel (FSDP)

Notebook

Distributed training using Ray

Notebook

Distributed supervised fine-tuning using TRL

Notebook

Distributed Training of OpenAI gpt-oss 20B on 8 H100 using TRL and DDP

Notebook

Distributed Training of OpenAI gpt-oss 120B on 8 H100 using TRL and FSDP

Notebook

Feedback

Additional resources