Why Latency Matters
In user-facing AI applications, every millisecond counts:
- 100ms: Feels instant
- 300ms: Noticeable delay
- 1 second: User frustration begins
- 3+ seconds: Users abandon
This guide covers techniques to achieve real-time inference for LLMs, vision models, and audio processing.
π‘ Latency Breakdown
Total latency = Network + Queue + Processing + Response. Optimize all four components.
Understanding Latency Components
# Typical latency breakdown for LLM request
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Total: 850ms β
ββββββββββββ¬βββββββββββ¬βββββββββββββββββββ¬ββββββββββββββββ€
β Network β Queue β Processing β Response β
β 50ms β 100ms β 600ms β 100ms β
ββββββββββββ΄βββββββββββ΄βββββββββββββββββββ΄ββββββββββββββββ€
β ClientβServer β Wait β Tokenize+Generate β ServerβClient β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Optimization 1: Streaming Responses
Instead of waiting for full generation, stream tokens as they're produced:
# Server-side streaming with FastAPI
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from vllm import LLM, SamplingParams
app = FastAPI()
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
@app.post("/generate/stream")
async def generate_stream(prompt: str):
async def generate():
sampling_params = SamplingParams(
max_tokens=512,
temperature=0.7
)
# Stream outputs
async for output in llm.generate(
[prompt],
sampling_params,
stream=True
):
token = output.outputs[0].text
yield f"data: {token}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream"
)
# Client-side consumption
async def consume_stream():
async with aiohttp.ClientSession() as session:
async with session.post(
"http://api/generate/stream",
json={"prompt": "Hello"}
) as response:
async for line in response.content:
if line.startswith(b"data: "):
token = line[6:].decode()
if token != "[DONE]":
print(token, end="", flush=True)
Time to First Token (TTFT)
With streaming, users see output within 50-200ms instead of waiting seconds:
| Metric | Without Streaming | With Streaming |
|---|---|---|
| Time to first visible output | 3-5 seconds | 50-200ms |
| Perceived responsiveness | Slow | Instant |
| User can cancel early | No | Yes |
Optimization 2: Speculative Decoding
Use a smaller model to draft, then verify with large model:
# Speculative decoding in vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
speculative_model="meta-llama/Llama-3.1-8B-Instruct",
num_speculative_tokens=5, # Draft 5 tokens at a time
use_v2_block_manager=True
)
# 1.5-2x speedup with same output quality
outputs = llm.generate(
["Explain quantum computing"],
SamplingParams(max_tokens=256)
)
Optimization 3: KV Cache Optimization
Prefix Caching
Cache KV values for common prefixes (system prompts):
# vLLM automatic prefix caching
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
enable_prefix_caching=True # Cache common prefixes
)
# System prompt is cached after first request
system_prompt = """You are a helpful assistant. You provide clear,
accurate answers. You always cite sources when possible."""
# Subsequent requests with same prefix are faster
for user_query in user_queries:
prompt = f"{system_prompt}\n\nUser: {user_query}\nAssistant:"
# First request: ~200ms
# Subsequent: ~50ms (prefix cached)
PagedAttention
# vLLM uses PagedAttention by default
# Efficiently manages KV cache memory
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
gpu_memory_utilization=0.9, # Use 90% of VRAM
max_num_seqs=256, # Max concurrent requests
)
Optimization 4: Model Optimization
Quantization
# INT8 quantization for 2x speedup
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
quantization="awq", # or "gptq", "squeezellm"
dtype="half"
)
# Latency comparison (A100):
# FP16: 45ms/token
# INT8: 25ms/token
# INT4: 18ms/token
TensorRT-LLM
# Maximum performance with TensorRT-LLM
# See our TensorRT guide for details
# Build optimized engine
trtllm-build \
--checkpoint_dir ./checkpoint \
--output_dir ./engine \
--gemm_plugin float16 \
--max_batch_size 64 \
--paged_kv_cache enable
# 2-3x faster than vanilla PyTorch
Need Low-Latency GPUs?
H100 GPUs on GPUBrazil deliver 2x faster inference than A100. Perfect for real-time applications.
Try H100 GPUs βOptimization 5: Infrastructure
Geographic Proximity
# Deploy inference close to users
#
# User Location β Nearest Inference Region
# US East β Virginia
# US West β California
# Europe β Frankfurt
# Asia β Singapore
# Network latency savings: 50-200ms per request
Connection Pooling
# Reuse connections to avoid handshake overhead
import aiohttp
# Create persistent session
connector = aiohttp.TCPConnector(
limit=100,
keepalive_timeout=30,
enable_cleanup_closed=True
)
session = aiohttp.ClientSession(connector=connector)
# Reuse for all requests
# Saves ~50ms per request (TCP + TLS handshake)
GPU Warmup
# Warm up GPU before serving traffic
def warmup_model(llm, num_warmup=10):
"""Run warmup requests to optimize CUDA kernels"""
warmup_prompt = "Hello, how are you?"
for _ in range(num_warmup):
llm.generate([warmup_prompt], SamplingParams(max_tokens=10))
print("Model warmed up!")
# First request after cold start: ~500ms
# After warmup: ~50ms
Optimization 6: Batching Strategies
Continuous Batching
# vLLM handles this automatically
# New requests join batch as slots become available
βββββββββββββββββββββββββββββββββββββββββββ
β Time β β
βββββββββββββββββββββββββββββββββββββββββββ€
β Req 1: ββββββββββββββββ β
β Req 2: ββββββββββββββββββββ β
β Req 3: ββββββββββββ β
β Req 4: ββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββ
# Requests complete at different times
# No waiting for batch to fill
Dynamic Batching with Timeout
# Triton dynamic batching config
dynamic_batching {
preferred_batch_size: [ 4, 8, 16 ]
max_queue_delay_microseconds: 100000 # 100ms max wait
}
# Trade-off:
# - Lower delay = faster response, lower throughput
# - Higher delay = slower response, higher throughput
Profiling and Measurement
import time
from dataclasses import dataclass
@dataclass
class LatencyMetrics:
total_ms: float
ttft_ms: float # Time to first token
tpot_ms: float # Time per output token
tokens_generated: int
def measure_inference(llm, prompt: str, max_tokens: int = 100):
start = time.perf_counter()
first_token_time = None
tokens = 0
for output in llm.generate([prompt], stream=True):
if first_token_time is None:
first_token_time = time.perf_counter()
tokens += 1
end = time.perf_counter()
total_ms = (end - start) * 1000
ttft_ms = (first_token_time - start) * 1000
generation_time = end - first_token_time
tpot_ms = (generation_time / tokens) * 1000
return LatencyMetrics(
total_ms=total_ms,
ttft_ms=ttft_ms,
tpot_ms=tpot_ms,
tokens_generated=tokens
)
# Profile your setup
metrics = measure_inference(llm, "Explain AI in one paragraph")
print(f"TTFT: {metrics.ttft_ms:.0f}ms")
print(f"TPOT: {metrics.tpot_ms:.1f}ms")
print(f"Total: {metrics.total_ms:.0f}ms")
Real-Time Vision Models
import torch
from torchvision import transforms
import time
# Optimize for inference
model = torch.jit.script(model) # JIT compile
model = model.cuda().half() # FP16
# Warm up
for _ in range(10):
with torch.no_grad():
_ = model(torch.randn(1, 3, 224, 224).cuda().half())
# Benchmark
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(100):
with torch.no_grad():
output = model(input_tensor)
torch.cuda.synchronize()
end = time.perf_counter()
avg_ms = (end - start) / 100 * 1000
print(f"Average latency: {avg_ms:.2f}ms")
# Further optimizations:
# - TensorRT: 2-3x faster
# - INT8 quantization: 2x faster
# - Smaller input resolution: linear speedup
Audio Processing Latency
# Whisper streaming transcription
import numpy as np
from faster_whisper import WhisperModel
model = WhisperModel(
"large-v3",
device="cuda",
compute_type="float16"
)
def transcribe_chunk(audio_chunk: np.ndarray) -> str:
"""Process audio in chunks for lower latency"""
segments, _ = model.transcribe(
audio_chunk,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500)
)
return " ".join(s.text for s in segments)
# Chunk-based processing:
# - Process 5-second chunks
# - Overlap by 0.5 seconds
# - Latency: ~1-2 seconds behind real-time
Latency Checklist
- β Enable response streaming
- β Use quantized models (AWQ/GPTQ)
- β Enable prefix caching
- β Deploy geographically close to users
- β Use connection pooling
- β Warm up models before serving
- β Profile and measure regularly
- β Consider speculative decoding for LLMs
- β Use TensorRT for vision models
Benchmarks by GPU
| GPU | LLaMA 8B TTFT | Tokens/sec | Cost/hr |
|---|---|---|---|
| RTX 4090 | ~80ms | ~90 | $0.40 |
| A100 40GB | ~60ms | ~120 | $1.50 |
| A100 80GB | ~55ms | ~130 | $2.00 |
| H100 80GB | ~35ms | ~200 | $3.50 |
Measured with vLLM, FP16, single GPU
Conclusion
Achieving sub-100ms latency requires optimization at every layer:
- Model level: Quantization, speculative decoding
- Serving level: Streaming, caching, batching
- Infrastructure: Geographic placement, connection reuse
Start with streamingβit has the biggest impact on perceived latency. Then profile and optimize the slowest components.
Deploy on GPUBrazil for low-latency GPU infrastructure optimized for real-time AI.