Modern deep learning models reach hundreds of millions to hundreds of billions of parameters — training them on a single GPU is no longer realistic. Databricks provides an integrated environment for running distributed training with PyTorch, TensorFlow, DeepSpeed, and more directly from a notebook, leveraging the Spark cluster infrastructure. This article walks through four frameworks — TorchDistributor, HorovodRunner, Ray, and DeepSpeed — covering distributed strategy selection, GPU configuration, and MLflow integration from both practical and certification perspectives.
Distributed training strategies fall into two broad categories. The choice depends on the relationship between model size and GPU memory.
Place a full copy of the model on each GPU, split the training data, and have each GPU process a different mini-batch. Each GPU runs the forward and backward passes independently, then gradients are aggregated via AllReduce to keep parameters in sync. If the model fits in a single GPU's memory, data parallelism is the first choice.
データ並列の概念図:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ GPU 0 │ │ GPU 1 │ │ GPU 2 │
│ Model Copy │ │ Model Copy │ │ Model Copy │
│ Batch 0-99 │ │ Batch 100-199│ │ Batch 200-299│
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────┬───────┴────────┬───────┘
│ AllReduce │
│ (勾配集約) │
▼ ▼
パラメータ同期 → 次のイテレーションWhen the model does not fit in a single GPU's memory, the model itself is split across multiple GPUs. Tensor parallelism splits the matrix operations within a single layer; pipeline parallelism assigns layers to GPUs stage by stage. For LLM fine-tuning, 3D parallelism — combining data + tensor + pipeline parallelism — is the mainstream approach.
モデル並列の概念図:
パイプライン並列:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ GPU 0 │───▶│ GPU 1 │───▶│ GPU 2 │
│ Layer 0-7│ │ Layer 8-15│ │Layer 16-23│
└──────────┘ └──────────┘ └──────────┘
テンソル並列 (1レイヤー内):
┌────────────────────────────┐
│ Weight Matrix │
│ ┌──────┬──────┬──────┐ │
│ │GPU 0 │GPU 1 │GPU 2 │ │
│ │Col0-3│Col4-7│Col8-11│ │
│ └──────┴──────┴──────┘ │
└────────────────────────────┘TorchDistributor is an API introduced in Databricks Runtime 13.0+ for running PyTorch DDP on a Spark cluster. It uses Spark task scheduling to place PyTorch processes on worker nodes and uses the NCCL backend for GPU-to-GPU communication. Traditional PyTorch DDP required a torchrun command or launcher script, but TorchDistributor lets you launch distributed training directly from a notebook, dramatically reducing setup overhead.
from pyspark.ml.torch.distributor import TorchDistributor
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
def train_fn():
dist.init_process_group("nccl")
rank = dist.get_rank()
device = torch.device(f"cuda:{rank % torch.cuda.device_count()}")
model = nn.Sequential(
nn.Linear(784, 512), nn.ReLU(),
nn.Linear(512, 256), nn.ReLU(),
nn.Linear(256, 10)
).to(device)
model = DDP(model, device_ids=[device])
dataset = load_training_data() # 訓練データの読み込み
sampler = DistributedSampler(dataset)
loader = DataLoader(dataset, batch_size=64, sampler=sampler)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(10):
sampler.set_epoch(epoch)
for batch_x, batch_y in loader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
loss = loss_fn(model(batch_x), batch_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
dist.destroy_process_group()
return model
# Sparkクラスタ上で4GPU分散訓練を実行
distributor = TorchDistributor(
num_processes=4,
local_mode=False, # True: ドライバノードのみ、False: ワーカーノードに分散
use_gpu=True
)
trained_model = distributor.run(train_fn)local_mode=True runs multi-GPU training on the driver node (useful for debugging), whilelocal_mode=False distributes execution across worker nodes.num_processes specifies the total number of GPUs to use. If your cluster has 2 workers with 2 GPUs each, set num_processes=4.
Horovod is a framework-agnostic distributed training library developed by Uber. It supports PyTorch, TensorFlow, and Keras, and aggregates gradients via MPI-based AllReduce communication. Databricks provides a wrapper API called HorovodRunner that launches Horovod processes in barrier mode on a Spark cluster.
import horovod.torch as hvd
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, DistributedSampler
def train_hvd():
hvd.init()
rank = hvd.rank()
torch.cuda.set_device(hvd.local_rank())
device = torch.device("cuda")
model = nn.Sequential(
nn.Linear(784, 512), nn.ReLU(),
nn.Linear(512, 256), nn.ReLU(),
nn.Linear(256, 10)
).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3 * hvd.size())
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
dataset = load_training_data()
sampler = DistributedSampler(dataset, num_replicas=hvd.size(), rank=hvd.rank())
loader = DataLoader(dataset, batch_size=64, sampler=sampler)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(10):
sampler.set_epoch(epoch)
for batch_x, batch_y in loader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
loss = loss_fn(model(batch_x), batch_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return model
from sparkdl import HorovodRunner
hr = HorovodRunner(np=4, driver_log_verbosity="all")
trained_model = hr.run(train_hvd)HorovodRunner's np parameter sets the number of processes. Setting np=-1 automatically detects and uses every GPU in the cluster. The main difference from TorchDistributor is that Horovod requires wrapping the optimizer with DistributedOptimizer and explicitly syncing initial parameters via broadcast_parameters. TorchDistributor uses PyTorch's standard DDP wrapper directly, so existing DDP code can be migrated with minimal changes.
Starting with Ray 2.x, Databricks clusters can also operate as Ray clusters.ray.util.spark.setup_ray_cluster() launches Ray workers on top of Spark workers, letting Ray Train and Ray Tune run distributed PyTorch/TensorFlow training together with hyperparameter tuning.
Ray's strength is the integration of training and tuning. Combining Ray Tune's search algorithms (ASHA, PBT, etc.) with Ray Train's distributed training lets you run multiple hyperparameter configurations as parallel distributed training jobs. For pure distributed training, TorchDistributor is simpler to set up, but Ray wins when you need large-scale hyperparameter search alongside it.
Microsoft's DeepSpeed is best known for memory efficiency via the ZeRO optimizer. ZeRO Stage 1 partitions optimizer state, Stage 2 also partitions gradients, and Stage 3 partitions model parameters as well — dramatically reducing GPU memory consumption. On Databricks, you initialize DeepSpeed inside a TorchDistributor training function.
For fine-tuning 70B-parameter LLMs, combining ZeRO Stage 3 with Offload (moving parameters to CPU memory) makes it feasible on clusters as modest as 8x A100 80GB. DeepSpeed controls optimizations via a JSON config file, so you can adjust memory optimization levels and batch sizes without changing your code.
| Item | TorchDistributor | HorovodRunner | Ray Train | DeepSpeed |
|---|---|---|---|---|
| Supported frameworks | PyTorch only | PyTorch / TensorFlow / Keras | PyTorch / TensorFlow / JAX | PyTorch (mainly Transformers) |
| Communication backend | NCCL (GPU) / Gloo (CPU) | MPI / NCCL / Gloo | NCCL / Gloo (uses PyTorch DDP internally) | NCCL (with built-in ZeRO communication optimization) |
| Key advantages | Native Spark integration, minimal setup | Framework-agnostic with a long production track record | Integrates training + tuning | Memory efficiency for very large models |
| Recommended scenarios | New PyTorch DDP projects | Distributed TensorFlow / existing Horovod code | Combined with hyperparameter search | LLM fine-tuning (10B+ parameters) |
Distributed training brings inter-node communication overhead. Going multi-node blindly can make training slower because communication cost becomes the bottleneck. Use the following criteria to decide.
| Decision axis | Single-node multi-GPU | Multi-node distributed |
|---|---|---|
| Model size | Fits in a single node's combined GPU memory | Does not fit in one node (e.g., 70B LLM) |
| Data size | Hundreds of GB or less | TB-scale datasets |
| Training time | Acceptable (hours to a day) | Would take days or more on a single node |
| Communication | NVLink / NVSwitch (fast intra-node communication) | InfiniBand recommended (inter-node communication is the bottleneck) |
| Debuggability | High (contained within one node) | Low (must handle node failures and communication timeouts) |
When configuring a GPU cluster on Databricks, pick a GPU instance type that matches the workload.
| GPU | Memory | FP16 performance | Use cases | Example AWS instance |
|---|---|---|---|---|
| A10G | 24 GB | 125 TFLOPS | Mid-sized model inference, lightweight fine-tuning | g5.xlarge - g5.48xlarge |
| A100 | 40 / 80 GB | 312 TFLOPS | Large-scale training, LLM fine-tuning | p4d.24xlarge (8x A100 40GB) |
| H100 | 80 GB | 990 TFLOPS | Very large LLM training, mixed precision via Transformer Engine | p5.48xlarge (8x H100) |
A10G offers strong cost-performance and is well suited to fine-tuning and inference for sub-7B models. A100 provides the memory bandwidth and TFLOPS needed for serious LLM training and is the standard for distributed training. H100 ships with an FP8-capable Transformer Engine and delivers roughly 3x the training throughput of A100, but at a correspondingly higher cost.
For cluster configuration, choose Databricks Runtime ML (MLR), which comes preinstalled with GPU drivers, CUDA, cuDNN, NCCL, PyTorch, and TensorFlow. For multi-node distributed training, network bandwidth between workers matters: use EFA networking on AWS or InfiniBand-capable instances on Azure to reduce communication bottlenecks.
In distributed training, having multiple processes write to MLflow simultaneously causes duplicated metrics and conflicting runs. The best practice is to only let the rank 0 process log to MLflow.
import mlflow
import torch.distributed as dist
def train_fn():
dist.init_process_group("nccl")
rank = dist.get_rank()
# ... モデル定義・訓練ループ ...
# rank 0 のみが MLflow にログを記録
if rank == 0:
with mlflow.start_run():
mlflow.log_param("num_gpus", dist.get_world_size())
mlflow.log_param("batch_size_per_gpu", 64)
mlflow.log_param("effective_batch_size", 64 * dist.get_world_size())
mlflow.log_metric("final_loss", avg_loss)
mlflow.pytorch.log_model(model.module, "model")
dist.destroy_process_group()model.module is the original model with the DDP wrapper stripped off. Saving the model while still wrapped in DDP requires DDP initialization at inference time, so always unwrap with .module before logging to MLflow. effective_batch_size = GPU count x per-GPU batch size — log it as the rationale for learning rate scaling.
Distributed training is a topic the ML Professional exam covers in depth. Locking down the following points will pay dividends on the exam.
local_mode=True runs locally on the driver node (for debugging); False distributes across worker nodes.np=-1 automatically uses every GPU in the clusterML Professional
問題 1
An ML engineer wants to fine-tune a 13B-parameter Transformer model on a Databricks cluster (4x A100 80GB). The combined memory for all model parameters and optimizer state requires about 200GB of GPU memory. Which approach is most appropriate?
正解: B
Model parameters plus optimizer state require about 200GB, which does not fit on a single 80GB A100. Standard data parallelism (A and C) places a full model copy on each GPU, so 80GB per GPU is not enough. DeepSpeed ZeRO Stage 3 partitions all three — optimizer state, gradients, and parameters — across GPUs: 200GB / 4 = 50GB per GPU, which fits within 80GB. D is hyperparameter search and does not address the model memory problem.
Should I use TorchDistributor or HorovodRunner?
TorchDistributor is recommended for new projects. It is natively supported in Databricks Runtime 13.0+ and runs PyTorch DDP seamlessly on Spark. HorovodRunner supports multiple frameworks (TensorFlow, MXNet, etc.), but Uber has been scaling back active development of Horovod, so for PyTorch-only workloads TorchDistributor is simpler to set up and better officially supported.
When should I use single-node multi-GPU versus distributed multi-node?
The standard pattern is to start with single-node multi-GPU (e.g., a 4xA100 instance) and only scale out to multi-node when GPU memory or throughput becomes a bottleneck. Single-node training has no inter-node communication overhead and is much easier to debug. You need multi-node when the model does not fit in a single node's combined GPU memory, or when data is large enough that you need more batch parallelism.
Which Databricks certifications cover distributed training?
ML Professional covers it in the most depth: when to choose TorchDistributor vs HorovodRunner, the difference between data and model parallelism, and how to track distributed runs in MLflow. ML Associate touches on the basics (why distributed training is needed and how data parallelism works). The GenAI Engineer exam may include DeepSpeed-based LLM fine-tuning as a related topic.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...