Distributed Training on Databricks: TorchDistributor & Horovod (2026)

Modern deep learning models reach hundreds of millions to hundreds of billions of parameters — training them on a single GPU is no longer realistic. Databricks provides an integrated environment for running distributed training with PyTorch, TensorFlow, DeepSpeed, and more directly from a notebook, leveraging the Spark cluster infrastructure. This article walks through four frameworks — TorchDistributor, HorovodRunner, Ray, and DeepSpeed — covering distributed strategy selection, GPU configuration, and MLflow integration from both practical and certification perspectives.

Data Parallelism vs Model Parallelism

Distributed training strategies fall into two broad categories. The choice depends on the relationship between model size and GPU memory.

Data Parallelism

Place a full copy of the model on each GPU, split the training data, and have each GPU process a different mini-batch. Each GPU runs the forward and backward passes independently, then gradients are aggregated via AllReduce to keep parameters in sync. If the model fits in a single GPU's memory, data parallelism is the first choice.

データ並列の概念図:

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   GPU 0     │  │   GPU 1     │  │   GPU 2     │
│ Model Copy  │  │ Model Copy  │  │ Model Copy  │
│ Batch 0-99  │  │ Batch 100-199│ │ Batch 200-299│
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                │                │
       └────────┬───────┴────────┬───────┘
                │  AllReduce     │
                │  (勾配集約)     │
                ▼                ▼
         パラメータ同期 → 次のイテレーション

Model Parallelism

When the model does not fit in a single GPU's memory, the model itself is split across multiple GPUs. Tensor parallelism splits the matrix operations within a single layer; pipeline parallelism assigns layers to GPUs stage by stage. For LLM fine-tuning, 3D parallelism — combining data + tensor + pipeline parallelism — is the mainstream approach.

モデル並列の概念図:

パイプライン並列:
┌──────────┐    ┌──────────┐    ┌──────────┐
│  GPU 0   │───▶│  GPU 1   │───▶│  GPU 2   │
│ Layer 0-7│    │ Layer 8-15│   │Layer 16-23│
└──────────┘    └──────────┘    └──────────┘

テンソル並列 (1レイヤー内):
┌────────────────────────────┐
│       Weight Matrix        │
│  ┌──────┬──────┬──────┐    │
│  │GPU 0 │GPU 1 │GPU 2 │    │
│  │Col0-3│Col4-7│Col8-11│   │
│  └──────┴──────┴──────┘    │
└────────────────────────────┘

TorchDistributor

TorchDistributor is an API introduced in Databricks Runtime 13.0+ for running PyTorch DDP on a Spark cluster. It uses Spark task scheduling to place PyTorch processes on worker nodes and uses the NCCL backend for GPU-to-GPU communication. Traditional PyTorch DDP required a torchrun command or launcher script, but TorchDistributor lets you launch distributed training directly from a notebook, dramatically reducing setup overhead.

from pyspark.ml.torch.distributor import TorchDistributor
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler

def train_fn():
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    device = torch.device(f"cuda:{rank % torch.cuda.device_count()}")

    model = nn.Sequential(
        nn.Linear(784, 512), nn.ReLU(),
        nn.Linear(512, 256), nn.ReLU(),
        nn.Linear(256, 10)
    ).to(device)
    model = DDP(model, device_ids=[device])

    dataset = load_training_data()  # 訓練データの読み込み
    sampler = DistributedSampler(dataset)
    loader = DataLoader(dataset, batch_size=64, sampler=sampler)

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    loss_fn = nn.CrossEntropyLoss()

    for epoch in range(10):
        sampler.set_epoch(epoch)
        for batch_x, batch_y in loader:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
            loss = loss_fn(model(batch_x), batch_y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    dist.destroy_process_group()
    return model

# Sparkクラスタ上で4GPU分散訓練を実行
distributor = TorchDistributor(
    num_processes=4,
    local_mode=False,    # True: ドライバノードのみ、False: ワーカーノードに分散
    use_gpu=True
)
trained_model = distributor.run(train_fn)

local_mode=True runs multi-GPU training on the driver node (useful for debugging), whilelocal_mode=False distributes execution across worker nodes.num_processes specifies the total number of GPUs to use. If your cluster has 2 workers with 2 GPUs each, set num_processes=4.

HorovodRunner

Horovod is a framework-agnostic distributed training library developed by Uber. It supports PyTorch, TensorFlow, and Keras, and aggregates gradients via MPI-based AllReduce communication. Databricks provides a wrapper API called HorovodRunner that launches Horovod processes in barrier mode on a Spark cluster.

import horovod.torch as hvd
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, DistributedSampler

def train_hvd():
    hvd.init()
    rank = hvd.rank()
    torch.cuda.set_device(hvd.local_rank())
    device = torch.device("cuda")

    model = nn.Sequential(
        nn.Linear(784, 512), nn.ReLU(),
        nn.Linear(512, 256), nn.ReLU(),
        nn.Linear(256, 10)
    ).to(device)

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3 * hvd.size())
    optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
    hvd.broadcast_parameters(model.state_dict(), root_rank=0)

    dataset = load_training_data()
    sampler = DistributedSampler(dataset, num_replicas=hvd.size(), rank=hvd.rank())
    loader = DataLoader(dataset, batch_size=64, sampler=sampler)

    loss_fn = nn.CrossEntropyLoss()
    for epoch in range(10):
        sampler.set_epoch(epoch)
        for batch_x, batch_y in loader:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
            loss = loss_fn(model(batch_x), batch_y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    return model

from sparkdl import HorovodRunner
hr = HorovodRunner(np=4, driver_log_verbosity="all")
trained_model = hr.run(train_hvd)

HorovodRunner's np parameter sets the number of processes. Setting np=-1 automatically detects and uses every GPU in the cluster. The main difference from TorchDistributor is that Horovod requires wrapping the optimizer with DistributedOptimizer and explicitly syncing initial parameters via broadcast_parameters. TorchDistributor uses PyTorch's standard DDP wrapper directly, so existing DDP code can be migrated with minimal changes.

Ray on Databricks

Starting with Ray 2.x, Databricks clusters can also operate as Ray clusters.ray.util.spark.setup_ray_cluster() launches Ray workers on top of Spark workers, letting Ray Train and Ray Tune run distributed PyTorch/TensorFlow training together with hyperparameter tuning.

Ray's strength is the integration of training and tuning. Combining Ray Tune's search algorithms (ASHA, PBT, etc.) with Ray Train's distributed training lets you run multiple hyperparameter configurations as parallel distributed training jobs. For pure distributed training, TorchDistributor is simpler to set up, but Ray wins when you need large-scale hyperparameter search alongside it.

DeepSpeed Integration

Microsoft's DeepSpeed is best known for memory efficiency via the ZeRO optimizer. ZeRO Stage 1 partitions optimizer state, Stage 2 also partitions gradients, and Stage 3 partitions model parameters as well — dramatically reducing GPU memory consumption. On Databricks, you initialize DeepSpeed inside a TorchDistributor training function.

For fine-tuning 70B-parameter LLMs, combining ZeRO Stage 3 with Offload (moving parameters to CPU memory) makes it feasible on clusters as modest as 8x A100 80GB. DeepSpeed controls optimizations via a JSON config file, so you can adjust memory optimization levels and batch sizes without changing your code.

Comparison of the 4 Distributed Frameworks

Item	TorchDistributor	HorovodRunner	Ray Train	DeepSpeed
Supported frameworks	PyTorch only	PyTorch / TensorFlow / Keras	PyTorch / TensorFlow / JAX	PyTorch (mainly Transformers)
Communication backend	NCCL (GPU) / Gloo (CPU)	MPI / NCCL / Gloo	NCCL / Gloo (uses PyTorch DDP internally)	NCCL (with built-in ZeRO communication optimization)
Key advantages	Native Spark integration, minimal setup	Framework-agnostic with a long production track record	Integrates training + tuning	Memory efficiency for very large models
Recommended scenarios	New PyTorch DDP projects	Distributed TensorFlow / existing Horovod code	Combined with hyperparameter search	LLM fine-tuning (10B+ parameters)

Single-Node vs Multi-Node Decision Criteria

Distributed training brings inter-node communication overhead. Going multi-node blindly can make training slower because communication cost becomes the bottleneck. Use the following criteria to decide.

Decision axis	Single-node multi-GPU	Multi-node distributed
Model size	Fits in a single node's combined GPU memory	Does not fit in one node (e.g., 70B LLM)
Data size	Hundreds of GB or less	TB-scale datasets
Training time	Acceptable (hours to a day)	Would take days or more on a single node
Communication	NVLink / NVSwitch (fast intra-node communication)	InfiniBand recommended (inter-node communication is the bottleneck)
Debuggability	High (contained within one node)	Low (must handle node failures and communication timeouts)

GPU Selection and Cluster Configuration

When configuring a GPU cluster on Databricks, pick a GPU instance type that matches the workload.

GPU	Memory	FP16 performance	Use cases	Example AWS instance
A10G	24 GB	125 TFLOPS	Mid-sized model inference, lightweight fine-tuning	g5.xlarge - g5.48xlarge
A100	40 / 80 GB	312 TFLOPS	Large-scale training, LLM fine-tuning	p4d.24xlarge (8x A100 40GB)
H100	80 GB	990 TFLOPS	Very large LLM training, mixed precision via Transformer Engine	p5.48xlarge (8x H100)

A10G offers strong cost-performance and is well suited to fine-tuning and inference for sub-7B models. A100 provides the memory bandwidth and TFLOPS needed for serious LLM training and is the standard for distributed training. H100 ships with an FP8-capable Transformer Engine and delivers roughly 3x the training throughput of A100, but at a correspondingly higher cost.

For cluster configuration, choose Databricks Runtime ML (MLR), which comes preinstalled with GPU drivers, CUDA, cuDNN, NCCL, PyTorch, and TensorFlow. For multi-node distributed training, network bandwidth between workers matters: use EFA networking on AWS or InfiniBand-capable instances on Azure to reduce communication bottlenecks.

MLflow Integration (Tracking Distributed Runs)

In distributed training, having multiple processes write to MLflow simultaneously causes duplicated metrics and conflicting runs. The best practice is to only let the rank 0 process log to MLflow.

import mlflow
import torch.distributed as dist

def train_fn():
    dist.init_process_group("nccl")
    rank = dist.get_rank()

    # ... モデル定義・訓練ループ ...

    # rank 0 のみが MLflow にログを記録
    if rank == 0:
        with mlflow.start_run():
            mlflow.log_param("num_gpus", dist.get_world_size())
            mlflow.log_param("batch_size_per_gpu", 64)
            mlflow.log_param("effective_batch_size", 64 * dist.get_world_size())
            mlflow.log_metric("final_loss", avg_loss)
            mlflow.pytorch.log_model(model.module, "model")

    dist.destroy_process_group()

model.module is the original model with the DDP wrapper stripped off. Saving the model while still wrapped in DDP requires DDP initialization at inference time, so always unwrap with .module before logging to MLflow. effective_batch_size = GPU count x per-GPU batch size — log it as the rationale for learning rate scaling.

Key Points for the Exam

Distributed training is a topic the ML Professional exam covers in depth. Locking down the following points will pay dividends on the exam.

TorchDistributor vs HorovodRunner: TorchDistributor is PyTorch-only with native Spark integration; HorovodRunner is framework-agnostic and MPI-based. Choose TorchDistributor for new PyTorch projects, and HorovodRunner for distributed TensorFlow or migrating existing Horovod code.
Data vs model parallelism: Use data parallelism if the model fits on a single GPU, otherwise model parallelism. The exam often asks decisions like "you want to train a 70B LLM on 4x A100" → model parallelism (ZeRO Stage 3) is required.
Meaning of local_mode: TorchDistributor's local_mode=True runs locally on the driver node (for debugging); False distributes across worker nodes.
MLflow logging best practice: only rank 0 writes logs, and unwrap the DDP wrapper before saving the model
Meaning of np=-1: HorovodRunner's np=-1 automatically uses every GPU in the cluster

Check with a Sample Question

ML Professional

問題 1

An ML engineer wants to fine-tune a 13B-parameter Transformer model on a Databricks cluster (4x A100 80GB). The combined memory for all model parameters and optimizer state requires about 200GB of GPU memory. Which approach is most appropriate?

Configure PyTorch DDP data parallelism with TorchDistributor and place a full copy of the model on each of the 4 GPUs.
Inside a TorchDistributor training function, initialize DeepSpeed ZeRO Stage 3 to partition optimizer state, gradients, and parameters across the 4 GPUs.
Use HorovodRunner with np=-1 and train with data parallelism via Horovod's DistributedOptimizer.
Configure hyperparameter search with Ray Tune and run trials in parallel, allocating one GPU per trial.

正解: B

Model parameters plus optimizer state require about 200GB, which does not fit on a single 80GB A100. Standard data parallelism (A and C) places a full model copy on each GPU, so 80GB per GPU is not enough. DeepSpeed ZeRO Stage 3 partitions all three — optimizer state, gradients, and parameters — across GPUs: 200GB / 4 = 50GB per GPU, which fits within 80GB. D is hyperparameter search and does not address the model memory problem.

Frequently Asked Questions

Should I use TorchDistributor or HorovodRunner?

TorchDistributor is recommended for new projects. It is natively supported in Databricks Runtime 13.0+ and runs PyTorch DDP seamlessly on Spark. HorovodRunner supports multiple frameworks (TensorFlow, MXNet, etc.), but Uber has been scaling back active development of Horovod, so for PyTorch-only workloads TorchDistributor is simpler to set up and better officially supported.

When should I use single-node multi-GPU versus distributed multi-node?

The standard pattern is to start with single-node multi-GPU (e.g., a 4xA100 instance) and only scale out to multi-node when GPU memory or throughput becomes a bottleneck. Single-node training has no inter-node communication overhead and is much easier to debug. You need multi-node when the model does not fit in a single node's combined GPU memory, or when data is large enough that you need more batch parallelism.

Which Databricks certifications cover distributed training?

ML Professional covers it in the most depth: when to choose TorchDistributor vs HorovodRunner, the difference between data and model parallelism, and how to track distributed runs in MLflow. ML Associate touches on the basics (why distributed training is needed and how data parallelism works). The GenAI Engineer exam may include DeepSpeed-based LLM fine-tuning as a related topic.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Databricks Distributed Training: Complete Guide