PyTorch Distributed Training: Multi-GPU Guide 2025

Why Distributed Training?

Modern AI models are too large for single GPUs. A 70B parameter model requires 140GB just for weights in FP16—far exceeding any single GPU's memory. Even smaller models benefit from distributed training through faster iteration.

This guide covers the three main approaches to distributed training in PyTorch:

DDP (DistributedDataParallel): Replicate model across GPUs
FSDP (Fully Sharded Data Parallel): Shard model across GPUs
DeepSpeed: Advanced optimization with ZeRO stages

💡 What you'll learn

How to scale training from 1 GPU to 8+ GPUs, reducing training time from days to hours while handling models that don't fit in single GPU memory.

Setup: Multi-GPU Instance

First, get a multi-GPU instance on GPUBrazil:

4x RTX 4090: $1.60/hr — Great for models up to 30B
8x A100 80GB: $9.60/hr — Train 70B+ models
8x H100: $19.92/hr — Maximum performance

# Verify GPU setup
nvidia-smi

# Check NCCL (GPU communication library)
python -c "import torch; print(torch.cuda.device_count())"

# Install dependencies
pip install torch torchvision accelerate deepspeed

Method 1: DistributedDataParallel (DDP)

DDP is the simplest approach—it replicates your model on each GPU and synchronizes gradients during backward pass.

When to Use DDP

Model fits in single GPU memory
You want linear scaling (4 GPUs = 4x batch size)
Simple setup with minimal code changes

DDP Training Script

import torch
import torch.distributed as dist
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
import os

def setup(rank, world_size):
    """Initialize distributed process group"""
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train(rank, world_size, epochs=10):
    setup(rank, world_size)
    
    # Create model and move to GPU
    model = YourModel().to(rank)
    model = DDP(model, device_ids=[rank])
    
    # Distributed sampler ensures each GPU gets different data
    dataset = YourDataset()
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
    
    for epoch in range(epochs):
        sampler.set_epoch(epoch)  # Shuffle differently each epoch
        
        for batch in dataloader:
            inputs, labels = batch
            inputs = inputs.to(rank)
            labels = labels.to(rank)
            
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = nn.functional.cross_entropy(outputs, labels)
            loss.backward()  # Gradients synchronized automatically
            optimizer.step()
        
        if rank == 0:
            print(f"Epoch {epoch} complete")
    
    cleanup()

if __name__ == "__main__":
    world_size = torch.cuda.device_count()
    torch.multiprocessing.spawn(train, args=(world_size,), nprocs=world_size)

Launch with torchrun

# Single node, 4 GPUs
torchrun --nproc_per_node=4 train.py

# Multiple nodes (run on each node)
torchrun --nnodes=2 --nproc_per_node=4 \
    --rdzv_id=job1 --rdzv_backend=c10d \
    --rdzv_endpoint=node0:29500 train.py

Method 2: Fully Sharded Data Parallel (FSDP)

FSDP shards model parameters across GPUs, enabling training of models larger than single GPU memory.

When to Use FSDP

Model doesn't fit in single GPU
You want PyTorch-native solution
Training 7B-70B parameter models

import torch
from torch.distributed.fsdp import (
    FullyShardedDataParallel as FSDP,
    MixedPrecision,
    ShardingStrategy,
)
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers import AutoModelForCausalLM, AutoTokenizer
import functools

def train_fsdp():
    # Initialize distributed
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    torch.cuda.set_device(rank)
    
    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-2-7b-hf",
        torch_dtype=torch.bfloat16,
    )
    
    # FSDP wrapping policy for transformers
    from transformers.models.llama.modeling_llama import LlamaDecoderLayer
    
    auto_wrap_policy = functools.partial(
        transformer_auto_wrap_policy,
        transformer_layer_cls={LlamaDecoderLayer},
    )
    
    # Mixed precision for efficiency
    mixed_precision = MixedPrecision(
        param_dtype=torch.bfloat16,
        reduce_dtype=torch.bfloat16,
        buffer_dtype=torch.bfloat16,
    )
    
    # Wrap model with FSDP
    model = FSDP(
        model,
        auto_wrap_policy=auto_wrap_policy,
        mixed_precision=mixed_precision,
        sharding_strategy=ShardingStrategy.FULL_SHARD,
        device_id=rank,
    )
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
    
    # Training loop (similar to DDP)
    for batch in dataloader:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
    
    dist.destroy_process_group()

🚀 FSDP Memory Savings

With 8 GPUs, FSDP reduces per-GPU memory by ~8x. A 70B model that needs 140GB can train on 8x 24GB GPUs!

Method 3: DeepSpeed ZeRO

DeepSpeed offers the most advanced memory optimization through ZeRO (Zero Redundancy Optimizer) stages.

ZeRO Stages Explained

Stage 1: Shard optimizer states (1.5x memory reduction)
Stage 2: + Shard gradients (4x reduction)
Stage 3: + Shard parameters (8x+ reduction)

DeepSpeed Config

// ds_config.json
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "none"},
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": 5e7,
        "stage3_prefetch_bucket_size": 5e7,
        "stage3_param_persistence_threshold": 1e5
    },
    "gradient_accumulation_steps": 4,
    "gradient_clipping": 1.0,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

Training with DeepSpeed

import deepspeed
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

# Load model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-hf")

# Training arguments with DeepSpeed
training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-5,
    bf16=True,
    deepspeed="ds_config.json",
    logging_steps=10,
    save_steps=500,
)

# Use HuggingFace Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

trainer.train()

Launch DeepSpeed

# Using deepspeed launcher
deepspeed --num_gpus=8 train.py --deepspeed ds_config.json

# Or with torchrun
torchrun --nproc_per_node=8 train.py

Using 🤗 Accelerate (Easiest)

HuggingFace Accelerate abstracts away distributed complexity:

from accelerate import Accelerator

accelerator = Accelerator()

# Prepare model, optimizer, dataloader
model, optimizer, dataloader = accelerator.prepare(
    model, optimizer, dataloader
)

for batch in dataloader:
    optimizer.zero_grad()
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)
    optimizer.step()

# Save model (only on main process)
accelerator.wait_for_everyone()
if accelerator.is_main_process:
    accelerator.save_model(model, "checkpoint")

Configure Accelerate

# Interactive setup
accelerate config

# Or use config file
accelerate launch --config_file accelerate_config.yaml train.py

⚠️ Common Pitfall

Always use accelerator.is_main_process for logging and saving. Otherwise, all GPUs will try to write simultaneously!

Performance Comparison

Method	Best For	Memory Efficiency	Speed
DDP	Small-medium models	1x	Fastest
FSDP	7B-70B models	8x	Fast
DeepSpeed ZeRO-3	70B+ models	10x+	Good
Accelerate	Any (wrapper)	Varies	Varies

Scaling Tips

1. Linear Scaling Rule

When increasing batch size, scale learning rate proportionally:

base_lr = 1e-4
effective_batch = batch_size * num_gpus * grad_accum
lr = base_lr * (effective_batch / base_batch)

2. Gradient Accumulation

Simulate larger batches without more memory:

for i, batch in enumerate(dataloader):
    loss = model(batch).loss / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

3. Mixed Precision

Use BF16/FP16 for 2x memory savings and faster compute:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

with autocast():
    outputs = model(inputs)
    loss = criterion(outputs, labels)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Scale Your Training Today

Multi-GPU instances from $1.60/hr. Train models 4-8x faster.

Get $5 Free Credit →

Troubleshooting

NCCL Errors

# Set NCCL debug
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# Use TCP if InfiniBand fails
export NCCL_IB_DISABLE=1

OOM Errors

Reduce batch size
Increase gradient accumulation
Enable activation checkpointing
Use ZeRO Stage 3 with CPU offload

Slow Training

Profile with torch.profiler
Check GPU utilization with nvidia-smi
Ensure data loading isn't bottleneck (increase workers)
Use NVLink-connected GPUs for faster communication

Conclusion

Distributed training is essential for modern AI development. Start with DDP for simplicity, move to FSDP or DeepSpeed when you need to scale model size.

With GPUBrazil's multi-GPU instances, you can experiment with all these approaches affordably. Start with 4x RTX 4090 at $1.60/hr and scale up as needed.