Why Distributed Training?

Modern AI models are too large for single GPUs. A 70B parameter model requires 140GB just for weights in FP16—far exceeding any single GPU's memory. Even smaller models benefit from distributed training through faster iteration.

This guide covers the three main approaches to distributed training in PyTorch:

💡 What you'll learn

How to scale training from 1 GPU to 8+ GPUs, reducing training time from days to hours while handling models that don't fit in single GPU memory.

Setup: Multi-GPU Instance

First, get a multi-GPU instance on GPUBrazil:

# Verify GPU setup
nvidia-smi

# Check NCCL (GPU communication library)
python -c "import torch; print(torch.cuda.device_count())"

# Install dependencies
pip install torch torchvision accelerate deepspeed

Method 1: DistributedDataParallel (DDP)

DDP is the simplest approach—it replicates your model on each GPU and synchronizes gradients during backward pass.

When to Use DDP

DDP Training Script

import torch
import torch.distributed as dist
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
import os

def setup(rank, world_size):
    """Initialize distributed process group"""
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train(rank, world_size, epochs=10):
    setup(rank, world_size)
    
    # Create model and move to GPU
    model = YourModel().to(rank)
    model = DDP(model, device_ids=[rank])
    
    # Distributed sampler ensures each GPU gets different data
    dataset = YourDataset()
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
    
    for epoch in range(epochs):
        sampler.set_epoch(epoch)  # Shuffle differently each epoch
        
        for batch in dataloader:
            inputs, labels = batch
            inputs = inputs.to(rank)
            labels = labels.to(rank)
            
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = nn.functional.cross_entropy(outputs, labels)
            loss.backward()  # Gradients synchronized automatically
            optimizer.step()
        
        if rank == 0:
            print(f"Epoch {epoch} complete")
    
    cleanup()

if __name__ == "__main__":
    world_size = torch.cuda.device_count()
    torch.multiprocessing.spawn(train, args=(world_size,), nprocs=world_size)

Launch with torchrun

# Single node, 4 GPUs
torchrun --nproc_per_node=4 train.py

# Multiple nodes (run on each node)
torchrun --nnodes=2 --nproc_per_node=4 \
    --rdzv_id=job1 --rdzv_backend=c10d \
    --rdzv_endpoint=node0:29500 train.py

Method 2: Fully Sharded Data Parallel (FSDP)

FSDP shards model parameters across GPUs, enabling training of models larger than single GPU memory.

When to Use FSDP

import torch
from torch.distributed.fsdp import (
    FullyShardedDataParallel as FSDP,
    MixedPrecision,
    ShardingStrategy,
)
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers import AutoModelForCausalLM, AutoTokenizer
import functools

def train_fsdp():
    # Initialize distributed
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    torch.cuda.set_device(rank)
    
    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-2-7b-hf",
        torch_dtype=torch.bfloat16,
    )
    
    # FSDP wrapping policy for transformers
    from transformers.models.llama.modeling_llama import LlamaDecoderLayer
    
    auto_wrap_policy = functools.partial(
        transformer_auto_wrap_policy,
        transformer_layer_cls={LlamaDecoderLayer},
    )
    
    # Mixed precision for efficiency
    mixed_precision = MixedPrecision(
        param_dtype=torch.bfloat16,
        reduce_dtype=torch.bfloat16,
        buffer_dtype=torch.bfloat16,
    )
    
    # Wrap model with FSDP
    model = FSDP(
        model,
        auto_wrap_policy=auto_wrap_policy,
        mixed_precision=mixed_precision,
        sharding_strategy=ShardingStrategy.FULL_SHARD,
        device_id=rank,
    )
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
    
    # Training loop (similar to DDP)
    for batch in dataloader:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
    
    dist.destroy_process_group()

🚀 FSDP Memory Savings

With 8 GPUs, FSDP reduces per-GPU memory by ~8x. A 70B model that needs 140GB can train on 8x 24GB GPUs!

Method 3: DeepSpeed ZeRO

DeepSpeed offers the most advanced memory optimization through ZeRO (Zero Redundancy Optimizer) stages.

ZeRO Stages Explained

DeepSpeed Config

// ds_config.json
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "none"},
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": 5e7,
        "stage3_prefetch_bucket_size": 5e7,
        "stage3_param_persistence_threshold": 1e5
    },
    "gradient_accumulation_steps": 4,
    "gradient_clipping": 1.0,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

Training with DeepSpeed

import deepspeed
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

# Load model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-hf")

# Training arguments with DeepSpeed
training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-5,
    bf16=True,
    deepspeed="ds_config.json",
    logging_steps=10,
    save_steps=500,
)

# Use HuggingFace Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

trainer.train()

Launch DeepSpeed

# Using deepspeed launcher
deepspeed --num_gpus=8 train.py --deepspeed ds_config.json

# Or with torchrun
torchrun --nproc_per_node=8 train.py

Using 🤗 Accelerate (Easiest)

HuggingFace Accelerate abstracts away distributed complexity:

from accelerate import Accelerator

accelerator = Accelerator()

# Prepare model, optimizer, dataloader
model, optimizer, dataloader = accelerator.prepare(
    model, optimizer, dataloader
)

for batch in dataloader:
    optimizer.zero_grad()
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)
    optimizer.step()

# Save model (only on main process)
accelerator.wait_for_everyone()
if accelerator.is_main_process:
    accelerator.save_model(model, "checkpoint")

Configure Accelerate

# Interactive setup
accelerate config

# Or use config file
accelerate launch --config_file accelerate_config.yaml train.py

⚠️ Common Pitfall

Always use accelerator.is_main_process for logging and saving. Otherwise, all GPUs will try to write simultaneously!

Performance Comparison

MethodBest ForMemory EfficiencySpeed
DDPSmall-medium models1xFastest
FSDP7B-70B models8xFast
DeepSpeed ZeRO-370B+ models10x+Good
AccelerateAny (wrapper)VariesVaries

Scaling Tips

1. Linear Scaling Rule

When increasing batch size, scale learning rate proportionally:

base_lr = 1e-4
effective_batch = batch_size * num_gpus * grad_accum
lr = base_lr * (effective_batch / base_batch)

2. Gradient Accumulation

Simulate larger batches without more memory:

for i, batch in enumerate(dataloader):
    loss = model(batch).loss / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

3. Mixed Precision

Use BF16/FP16 for 2x memory savings and faster compute:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

with autocast():
    outputs = model(inputs)
    loss = criterion(outputs, labels)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Scale Your Training Today

Multi-GPU instances from $1.60/hr. Train models 4-8x faster.

Get $5 Free Credit →

Troubleshooting

NCCL Errors

# Set NCCL debug
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# Use TCP if InfiniBand fails
export NCCL_IB_DISABLE=1

OOM Errors

Slow Training

Conclusion

Distributed training is essential for modern AI development. Start with DDP for simplicity, move to FSDP or DeepSpeed when you need to scale model size.

With GPUBrazil's multi-GPU instances, you can experiment with all these approaches affordably. Start with 4x RTX 4090 at $1.60/hr and scale up as needed.