Why vLLM?

vLLM is the gold standard for LLM inference. Its PagedAttention algorithm delivers 2-4x higher throughput than naive implementations, and it's used by companies serving billions of tokens daily.

But default settings leave performance on the table. This guide shows you how to squeeze every last token/second from your setup.

Baseline: Default vLLM Performance

Running Llama 3 8B with defaults on an H100:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct

Typical throughput: ~500-800 tokens/second (batched)

After optimization: 1,500-2,000+ tokens/second

Optimization 1: GPU Memory Utilization

By default, vLLM reserves some GPU memory as buffer. For dedicated inference servers, max it out:

--gpu-memory-utilization 0.95

This gives vLLM more space for KV cache, enabling more concurrent requests.

โš ๏ธ Caution

Don't go above 0.95. Leave some room for CUDA operations. Going to 0.99 can cause OOM errors.

Optimization 2: Tune Max Model Length

If your use case doesn't need full context length, reduce it:

# Default Llama 3 context: 8192
# If you only need 2048:
--max-model-len 2048

Shorter context = more KV cache space = more concurrent requests = higher throughput.

Max LengthConcurrent RequestsThroughput (H100)
8192~16~800 tok/s
4096~32~1,200 tok/s
2048~60~1,800 tok/s

Optimization 3: Tensor Parallelism

For larger models or multi-GPU setups:

# For 8x GPU setup
--tensor-parallel-size 8

# For 2x GPU setup
--tensor-parallel-size 2

Match tensor parallelism to your GPU count for even memory distribution.

Optimization 4: Quantization

FP8 quantization on H100 gives ~1.5x speedup with minimal quality loss:

--quantization fp8
--dtype float16

For A100/older GPUs, use AWQ or GPTQ quantized models:

--model TheBloke/Llama-2-70B-AWQ
--quantization awq

Optimization 5: Speculative Decoding

Use a smaller draft model to speed up generation:

--speculative-model meta-llama/Llama-3.2-1B
--num-speculative-tokens 5

This can give 1.5-2x speedup for long-form generation tasks.

Optimization 6: Prefix Caching

If many requests share the same system prompt, enable prefix caching:

--enable-prefix-caching

This caches KV values for common prefixes, reducing computation for repeated prompts.

Optimization 7: Chunked Prefill

For long prompts, chunked prefill prevents stalling other requests:

--enable-chunked-prefill
--max-num-batched-tokens 8192

Complete Optimized Configuration

Here's the full optimized command for Llama 3 8B on a single H100:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --gpu-memory-utilization 0.95 \
    --max-model-len 4096 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --dtype bfloat16 \
    --port 8000

For 70B on 8x H100:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-70B-Instruct \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.95 \
    --max-model-len 4096 \
    --enable-prefix-caching \
    --quantization fp8 \
    --dtype float16 \
    --port 8000

Benchmarking Your Setup

Use vLLM's built-in benchmark tool:

python -m vllm.entrypoints.openai.api_server ... &

# In another terminal:
python -m vllm.benchmarks.benchmark_serving \
    --backend vllm \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --num-prompts 1000 \
    --request-rate inf

Key metrics to watch:

Results: Before vs After

ConfigurationThroughputImprovement
vLLM defaults~750 tok/sbaseline
+ gpu-memory 0.95~900 tok/s+20%
+ max-len 4096~1,200 tok/s+60%
+ prefix caching~1,400 tok/s+87%
+ chunked prefill~1,600 tok/s+113%
+ FP8 quant~2,100 tok/s+180%

๐Ÿ’ก Cost Impact

2.8x throughput = 2.8x cost reduction. If you're spending $1,000/month on inference, these optimizations save you $640/month!

Get Maximum Performance from Your GPUs

Run vLLM on H100s for $2.80/hour with pre-configured optimizations.

Get $5 Free Credit โ†’

Common Issues

OOM Errors

Reduce --gpu-memory-utilization or --max-model-len

Low Throughput Despite Optimizations

Check request patterns. Single requests can't saturate the GPU - you need concurrent load.

High Latency for Single Requests

Consider speculative decoding or use a smaller model for low-latency needs.

Conclusion

With proper optimization, vLLM can deliver 2-3x more throughput than default settings. The key optimizations:

  1. Max out GPU memory utilization (0.95)
  2. Reduce max context if possible
  3. Enable prefix caching for repeated prompts
  4. Use FP8 quantization on H100s
  5. Enable chunked prefill for long prompts

These changes can cut your inference costs by 60%+ without any code changes to your application.

Get started with GPUBrazil and deploy optimized vLLM servers today!