Why Latency Matters

In user-facing AI applications, every millisecond counts:

This guide covers techniques to achieve real-time inference for LLMs, vision models, and audio processing.

πŸ’‘ Latency Breakdown

Total latency = Network + Queue + Processing + Response. Optimize all four components.

Understanding Latency Components

# Typical latency breakdown for LLM request
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Total: 850ms                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Network  β”‚  Queue   β”‚    Processing    β”‚   Response    β”‚
β”‚  50ms    β”‚  100ms   β”‚      600ms       β”‚    100ms      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
│ Client→Server │ Wait │ Tokenize+Generate │ Server→Client │
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Optimization 1: Streaming Responses

Instead of waiting for full generation, stream tokens as they're produced:

# Server-side streaming with FastAPI
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from vllm import LLM, SamplingParams

app = FastAPI()
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

@app.post("/generate/stream")
async def generate_stream(prompt: str):
    async def generate():
        sampling_params = SamplingParams(
            max_tokens=512,
            temperature=0.7
        )
        
        # Stream outputs
        async for output in llm.generate(
            [prompt], 
            sampling_params,
            stream=True
        ):
            token = output.outputs[0].text
            yield f"data: {token}\n\n"
        
        yield "data: [DONE]\n\n"
    
    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )

# Client-side consumption
async def consume_stream():
    async with aiohttp.ClientSession() as session:
        async with session.post(
            "http://api/generate/stream",
            json={"prompt": "Hello"}
        ) as response:
            async for line in response.content:
                if line.startswith(b"data: "):
                    token = line[6:].decode()
                    if token != "[DONE]":
                        print(token, end="", flush=True)

Time to First Token (TTFT)

With streaming, users see output within 50-200ms instead of waiting seconds:

MetricWithout StreamingWith Streaming
Time to first visible output3-5 seconds50-200ms
Perceived responsivenessSlowInstant
User can cancel earlyNoYes

Optimization 2: Speculative Decoding

Use a smaller model to draft, then verify with large model:

# Speculative decoding in vLLM
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.1-8B-Instruct",
    num_speculative_tokens=5,  # Draft 5 tokens at a time
    use_v2_block_manager=True
)

# 1.5-2x speedup with same output quality
outputs = llm.generate(
    ["Explain quantum computing"],
    SamplingParams(max_tokens=256)
)

Optimization 3: KV Cache Optimization

Prefix Caching

Cache KV values for common prefixes (system prompts):

# vLLM automatic prefix caching
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_prefix_caching=True  # Cache common prefixes
)

# System prompt is cached after first request
system_prompt = """You are a helpful assistant. You provide clear, 
accurate answers. You always cite sources when possible."""

# Subsequent requests with same prefix are faster
for user_query in user_queries:
    prompt = f"{system_prompt}\n\nUser: {user_query}\nAssistant:"
    # First request: ~200ms
    # Subsequent: ~50ms (prefix cached)

PagedAttention

# vLLM uses PagedAttention by default
# Efficiently manages KV cache memory

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    gpu_memory_utilization=0.9,  # Use 90% of VRAM
    max_num_seqs=256,  # Max concurrent requests
)

Optimization 4: Model Optimization

Quantization

# INT8 quantization for 2x speedup
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    quantization="awq",  # or "gptq", "squeezellm"
    dtype="half"
)

# Latency comparison (A100):
# FP16:  45ms/token
# INT8:  25ms/token
# INT4:  18ms/token

TensorRT-LLM

# Maximum performance with TensorRT-LLM
# See our TensorRT guide for details

# Build optimized engine
trtllm-build \
    --checkpoint_dir ./checkpoint \
    --output_dir ./engine \
    --gemm_plugin float16 \
    --max_batch_size 64 \
    --paged_kv_cache enable

# 2-3x faster than vanilla PyTorch

Need Low-Latency GPUs?

H100 GPUs on GPUBrazil deliver 2x faster inference than A100. Perfect for real-time applications.

Try H100 GPUs β†’

Optimization 5: Infrastructure

Geographic Proximity

# Deploy inference close to users
# 
# User Location β†’ Nearest Inference Region
# US East       β†’ Virginia
# US West       β†’ California  
# Europe        β†’ Frankfurt
# Asia          β†’ Singapore

# Network latency savings: 50-200ms per request

Connection Pooling

# Reuse connections to avoid handshake overhead
import aiohttp

# Create persistent session
connector = aiohttp.TCPConnector(
    limit=100,
    keepalive_timeout=30,
    enable_cleanup_closed=True
)

session = aiohttp.ClientSession(connector=connector)

# Reuse for all requests
# Saves ~50ms per request (TCP + TLS handshake)

GPU Warmup

# Warm up GPU before serving traffic
def warmup_model(llm, num_warmup=10):
    """Run warmup requests to optimize CUDA kernels"""
    warmup_prompt = "Hello, how are you?"
    
    for _ in range(num_warmup):
        llm.generate([warmup_prompt], SamplingParams(max_tokens=10))
    
    print("Model warmed up!")

# First request after cold start: ~500ms
# After warmup: ~50ms

Optimization 6: Batching Strategies

Continuous Batching

# vLLM handles this automatically
# New requests join batch as slots become available

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Time β†’                                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Req 1: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                 β”‚
β”‚ Req 2:     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ         β”‚
β”‚ Req 3:         β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ             β”‚
β”‚ Req 4:             β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

# Requests complete at different times
# No waiting for batch to fill

Dynamic Batching with Timeout

# Triton dynamic batching config
dynamic_batching {
    preferred_batch_size: [ 4, 8, 16 ]
    max_queue_delay_microseconds: 100000  # 100ms max wait
}

# Trade-off:
# - Lower delay = faster response, lower throughput
# - Higher delay = slower response, higher throughput

Profiling and Measurement

import time
from dataclasses import dataclass

@dataclass
class LatencyMetrics:
    total_ms: float
    ttft_ms: float  # Time to first token
    tpot_ms: float  # Time per output token
    tokens_generated: int

def measure_inference(llm, prompt: str, max_tokens: int = 100):
    start = time.perf_counter()
    first_token_time = None
    tokens = 0
    
    for output in llm.generate([prompt], stream=True):
        if first_token_time is None:
            first_token_time = time.perf_counter()
        tokens += 1
    
    end = time.perf_counter()
    
    total_ms = (end - start) * 1000
    ttft_ms = (first_token_time - start) * 1000
    generation_time = end - first_token_time
    tpot_ms = (generation_time / tokens) * 1000
    
    return LatencyMetrics(
        total_ms=total_ms,
        ttft_ms=ttft_ms,
        tpot_ms=tpot_ms,
        tokens_generated=tokens
    )

# Profile your setup
metrics = measure_inference(llm, "Explain AI in one paragraph")
print(f"TTFT: {metrics.ttft_ms:.0f}ms")
print(f"TPOT: {metrics.tpot_ms:.1f}ms")
print(f"Total: {metrics.total_ms:.0f}ms")

Real-Time Vision Models

import torch
from torchvision import transforms
import time

# Optimize for inference
model = torch.jit.script(model)  # JIT compile
model = model.cuda().half()      # FP16

# Warm up
for _ in range(10):
    with torch.no_grad():
        _ = model(torch.randn(1, 3, 224, 224).cuda().half())

# Benchmark
torch.cuda.synchronize()
start = time.perf_counter()

for _ in range(100):
    with torch.no_grad():
        output = model(input_tensor)

torch.cuda.synchronize()
end = time.perf_counter()

avg_ms = (end - start) / 100 * 1000
print(f"Average latency: {avg_ms:.2f}ms")

# Further optimizations:
# - TensorRT: 2-3x faster
# - INT8 quantization: 2x faster
# - Smaller input resolution: linear speedup

Audio Processing Latency

# Whisper streaming transcription
import numpy as np
from faster_whisper import WhisperModel

model = WhisperModel(
    "large-v3",
    device="cuda",
    compute_type="float16"
)

def transcribe_chunk(audio_chunk: np.ndarray) -> str:
    """Process audio in chunks for lower latency"""
    segments, _ = model.transcribe(
        audio_chunk,
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=500)
    )
    return " ".join(s.text for s in segments)

# Chunk-based processing:
# - Process 5-second chunks
# - Overlap by 0.5 seconds
# - Latency: ~1-2 seconds behind real-time

Latency Checklist

Benchmarks by GPU

GPULLaMA 8B TTFTTokens/secCost/hr
RTX 4090~80ms~90$0.40
A100 40GB~60ms~120$1.50
A100 80GB~55ms~130$2.00
H100 80GB~35ms~200$3.50

Measured with vLLM, FP16, single GPU

Conclusion

Achieving sub-100ms latency requires optimization at every layer:

Start with streamingβ€”it has the biggest impact on perceived latency. Then profile and optimize the slowest components.

Deploy on GPUBrazil for low-latency GPU infrastructure optimized for real-time AI.