Why vLLM?
vLLM is the gold standard for LLM inference. Its PagedAttention algorithm delivers 2-4x higher throughput than naive implementations, and it's used by companies serving billions of tokens daily.
But default settings leave performance on the table. This guide shows you how to squeeze every last token/second from your setup.
Baseline: Default vLLM Performance
Running Llama 3 8B with defaults on an H100:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct
Typical throughput: ~500-800 tokens/second (batched)
After optimization: 1,500-2,000+ tokens/second
Optimization 1: GPU Memory Utilization
By default, vLLM reserves some GPU memory as buffer. For dedicated inference servers, max it out:
--gpu-memory-utilization 0.95
This gives vLLM more space for KV cache, enabling more concurrent requests.
โ ๏ธ Caution
Don't go above 0.95. Leave some room for CUDA operations. Going to 0.99 can cause OOM errors.
Optimization 2: Tune Max Model Length
If your use case doesn't need full context length, reduce it:
# Default Llama 3 context: 8192
# If you only need 2048:
--max-model-len 2048
Shorter context = more KV cache space = more concurrent requests = higher throughput.
| Max Length | Concurrent Requests | Throughput (H100) |
|---|---|---|
| 8192 | ~16 | ~800 tok/s |
| 4096 | ~32 | ~1,200 tok/s |
| 2048 | ~60 | ~1,800 tok/s |
Optimization 3: Tensor Parallelism
For larger models or multi-GPU setups:
# For 8x GPU setup
--tensor-parallel-size 8
# For 2x GPU setup
--tensor-parallel-size 2
Match tensor parallelism to your GPU count for even memory distribution.
Optimization 4: Quantization
FP8 quantization on H100 gives ~1.5x speedup with minimal quality loss:
--quantization fp8
--dtype float16
For A100/older GPUs, use AWQ or GPTQ quantized models:
--model TheBloke/Llama-2-70B-AWQ
--quantization awq
Optimization 5: Speculative Decoding
Use a smaller draft model to speed up generation:
--speculative-model meta-llama/Llama-3.2-1B
--num-speculative-tokens 5
This can give 1.5-2x speedup for long-form generation tasks.
Optimization 6: Prefix Caching
If many requests share the same system prompt, enable prefix caching:
--enable-prefix-caching
This caches KV values for common prefixes, reducing computation for repeated prompts.
Optimization 7: Chunked Prefill
For long prompts, chunked prefill prevents stalling other requests:
--enable-chunked-prefill
--max-num-batched-tokens 8192
Complete Optimized Configuration
Here's the full optimized command for Llama 3 8B on a single H100:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--gpu-memory-utilization 0.95 \
--max-model-len 4096 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--dtype bfloat16 \
--port 8000
For 70B on 8x H100:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.95 \
--max-model-len 4096 \
--enable-prefix-caching \
--quantization fp8 \
--dtype float16 \
--port 8000
Benchmarking Your Setup
Use vLLM's built-in benchmark tool:
python -m vllm.entrypoints.openai.api_server ... &
# In another terminal:
python -m vllm.benchmarks.benchmark_serving \
--backend vllm \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--num-prompts 1000 \
--request-rate inf
Key metrics to watch:
- Throughput: tokens/second (higher is better)
- TTFT: Time to first token (lower is better)
- ITL: Inter-token latency (lower is better)
Results: Before vs After
| Configuration | Throughput | Improvement |
|---|---|---|
| vLLM defaults | ~750 tok/s | baseline |
| + gpu-memory 0.95 | ~900 tok/s | +20% |
| + max-len 4096 | ~1,200 tok/s | +60% |
| + prefix caching | ~1,400 tok/s | +87% |
| + chunked prefill | ~1,600 tok/s | +113% |
| + FP8 quant | ~2,100 tok/s | +180% |
๐ก Cost Impact
2.8x throughput = 2.8x cost reduction. If you're spending $1,000/month on inference, these optimizations save you $640/month!
Get Maximum Performance from Your GPUs
Run vLLM on H100s for $2.80/hour with pre-configured optimizations.
Get $5 Free Credit โCommon Issues
OOM Errors
Reduce --gpu-memory-utilization or --max-model-len
Low Throughput Despite Optimizations
Check request patterns. Single requests can't saturate the GPU - you need concurrent load.
High Latency for Single Requests
Consider speculative decoding or use a smaller model for low-latency needs.
Conclusion
With proper optimization, vLLM can deliver 2-3x more throughput than default settings. The key optimizations:
- Max out GPU memory utilization (0.95)
- Reduce max context if possible
- Enable prefix caching for repeated prompts
- Use FP8 quantization on H100s
- Enable chunked prefill for long prompts
These changes can cut your inference costs by 60%+ without any code changes to your application.
Get started with GPUBrazil and deploy optimized vLLM servers today!