Why Distributed Training?
Modern AI models are too large for single GPUs. A 70B parameter model requires 140GB just for weights in FP16—far exceeding any single GPU's memory. Even smaller models benefit from distributed training through faster iteration.
This guide covers the three main approaches to distributed training in PyTorch:
- DDP (DistributedDataParallel): Replicate model across GPUs
- FSDP (Fully Sharded Data Parallel): Shard model across GPUs
- DeepSpeed: Advanced optimization with ZeRO stages
💡 What you'll learn
How to scale training from 1 GPU to 8+ GPUs, reducing training time from days to hours while handling models that don't fit in single GPU memory.
Setup: Multi-GPU Instance
First, get a multi-GPU instance on GPUBrazil:
- 4x RTX 4090: $1.60/hr — Great for models up to 30B
- 8x A100 80GB: $9.60/hr — Train 70B+ models
- 8x H100: $19.92/hr — Maximum performance
# Verify GPU setup
nvidia-smi
# Check NCCL (GPU communication library)
python -c "import torch; print(torch.cuda.device_count())"
# Install dependencies
pip install torch torchvision accelerate deepspeed
Method 1: DistributedDataParallel (DDP)
DDP is the simplest approach—it replicates your model on each GPU and synchronizes gradients during backward pass.
When to Use DDP
- Model fits in single GPU memory
- You want linear scaling (4 GPUs = 4x batch size)
- Simple setup with minimal code changes
DDP Training Script
import torch
import torch.distributed as dist
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
import os
def setup(rank, world_size):
"""Initialize distributed process group"""
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def cleanup():
dist.destroy_process_group()
def train(rank, world_size, epochs=10):
setup(rank, world_size)
# Create model and move to GPU
model = YourModel().to(rank)
model = DDP(model, device_ids=[rank])
# Distributed sampler ensures each GPU gets different data
dataset = YourDataset()
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for epoch in range(epochs):
sampler.set_epoch(epoch) # Shuffle differently each epoch
for batch in dataloader:
inputs, labels = batch
inputs = inputs.to(rank)
labels = labels.to(rank)
optimizer.zero_grad()
outputs = model(inputs)
loss = nn.functional.cross_entropy(outputs, labels)
loss.backward() # Gradients synchronized automatically
optimizer.step()
if rank == 0:
print(f"Epoch {epoch} complete")
cleanup()
if __name__ == "__main__":
world_size = torch.cuda.device_count()
torch.multiprocessing.spawn(train, args=(world_size,), nprocs=world_size)
Launch with torchrun
# Single node, 4 GPUs
torchrun --nproc_per_node=4 train.py
# Multiple nodes (run on each node)
torchrun --nnodes=2 --nproc_per_node=4 \
--rdzv_id=job1 --rdzv_backend=c10d \
--rdzv_endpoint=node0:29500 train.py
Method 2: Fully Sharded Data Parallel (FSDP)
FSDP shards model parameters across GPUs, enabling training of models larger than single GPU memory.
When to Use FSDP
- Model doesn't fit in single GPU
- You want PyTorch-native solution
- Training 7B-70B parameter models
import torch
from torch.distributed.fsdp import (
FullyShardedDataParallel as FSDP,
MixedPrecision,
ShardingStrategy,
)
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers import AutoModelForCausalLM, AutoTokenizer
import functools
def train_fsdp():
# Initialize distributed
dist.init_process_group("nccl")
rank = dist.get_rank()
torch.cuda.set_device(rank)
# Load model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.bfloat16,
)
# FSDP wrapping policy for transformers
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
auto_wrap_policy = functools.partial(
transformer_auto_wrap_policy,
transformer_layer_cls={LlamaDecoderLayer},
)
# Mixed precision for efficiency
mixed_precision = MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.bfloat16,
buffer_dtype=torch.bfloat16,
)
# Wrap model with FSDP
model = FSDP(
model,
auto_wrap_policy=auto_wrap_policy,
mixed_precision=mixed_precision,
sharding_strategy=ShardingStrategy.FULL_SHARD,
device_id=rank,
)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
# Training loop (similar to DDP)
for batch in dataloader:
optimizer.zero_grad()
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
dist.destroy_process_group()
🚀 FSDP Memory Savings
With 8 GPUs, FSDP reduces per-GPU memory by ~8x. A 70B model that needs 140GB can train on 8x 24GB GPUs!
Method 3: DeepSpeed ZeRO
DeepSpeed offers the most advanced memory optimization through ZeRO (Zero Redundancy Optimizer) stages.
ZeRO Stages Explained
- Stage 1: Shard optimizer states (1.5x memory reduction)
- Stage 2: + Shard gradients (4x reduction)
- Stage 3: + Shard parameters (8x+ reduction)
DeepSpeed Config
// ds_config.json
{
"bf16": {"enabled": true},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {"device": "cpu"},
"offload_param": {"device": "none"},
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 5e7,
"stage3_prefetch_bucket_size": 5e7,
"stage3_param_persistence_threshold": 1e5
},
"gradient_accumulation_steps": 4,
"gradient_clipping": 1.0,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}
Training with DeepSpeed
import deepspeed
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
# Load model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-hf")
# Training arguments with DeepSpeed
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-5,
bf16=True,
deepspeed="ds_config.json",
logging_steps=10,
save_steps=500,
)
# Use HuggingFace Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)
trainer.train()
Launch DeepSpeed
# Using deepspeed launcher
deepspeed --num_gpus=8 train.py --deepspeed ds_config.json
# Or with torchrun
torchrun --nproc_per_node=8 train.py
Using 🤗 Accelerate (Easiest)
HuggingFace Accelerate abstracts away distributed complexity:
from accelerate import Accelerator
accelerator = Accelerator()
# Prepare model, optimizer, dataloader
model, optimizer, dataloader = accelerator.prepare(
model, optimizer, dataloader
)
for batch in dataloader:
optimizer.zero_grad()
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
# Save model (only on main process)
accelerator.wait_for_everyone()
if accelerator.is_main_process:
accelerator.save_model(model, "checkpoint")
Configure Accelerate
# Interactive setup
accelerate config
# Or use config file
accelerate launch --config_file accelerate_config.yaml train.py
⚠️ Common Pitfall
Always use accelerator.is_main_process for logging and saving. Otherwise, all GPUs will try to write simultaneously!
Performance Comparison
| Method | Best For | Memory Efficiency | Speed |
|---|---|---|---|
| DDP | Small-medium models | 1x | Fastest |
| FSDP | 7B-70B models | 8x | Fast |
| DeepSpeed ZeRO-3 | 70B+ models | 10x+ | Good |
| Accelerate | Any (wrapper) | Varies | Varies |
Scaling Tips
1. Linear Scaling Rule
When increasing batch size, scale learning rate proportionally:
base_lr = 1e-4
effective_batch = batch_size * num_gpus * grad_accum
lr = base_lr * (effective_batch / base_batch)
2. Gradient Accumulation
Simulate larger batches without more memory:
for i, batch in enumerate(dataloader):
loss = model(batch).loss / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
3. Mixed Precision
Use BF16/FP16 for 2x memory savings and faster compute:
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
outputs = model(inputs)
loss = criterion(outputs, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Scale Your Training Today
Multi-GPU instances from $1.60/hr. Train models 4-8x faster.
Get $5 Free Credit →Troubleshooting
NCCL Errors
# Set NCCL debug
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
# Use TCP if InfiniBand fails
export NCCL_IB_DISABLE=1
OOM Errors
- Reduce batch size
- Increase gradient accumulation
- Enable activation checkpointing
- Use ZeRO Stage 3 with CPU offload
Slow Training
- Profile with
torch.profiler - Check GPU utilization with
nvidia-smi - Ensure data loading isn't bottleneck (increase workers)
- Use NVLink-connected GPUs for faster communication
Conclusion
Distributed training is essential for modern AI development. Start with DDP for simplicity, move to FSDP or DeepSpeed when you need to scale model size.
With GPUBrazil's multi-GPU instances, you can experiment with all these approaches affordably. Start with 4x RTX 4090 at $1.60/hr and scale up as needed.