Real-Time AI Inference: Achieving Sub-100ms Latency

Why Latency Matters

In user-facing AI applications, every millisecond counts:

100ms: Feels instant
300ms: Noticeable delay
1 second: User frustration begins
3+ seconds: Users abandon

This guide covers techniques to achieve real-time inference for LLMs, vision models, and audio processing.

💡 Latency Breakdown

Total latency = Network + Queue + Processing + Response. Optimize all four components.

Understanding Latency Components

# Typical latency breakdown for LLM request
┌────────────────────────────────────────────────────────┐
│                    Total: 850ms                         │
├──────────┬──────────┬──────────────────┬───────────────┤
│ Network  │  Queue   │    Processing    │   Response    │
│  50ms    │  100ms   │      600ms       │    100ms      │
├──────────┴──────────┴──────────────────┴───────────────┤
│ Client→Server │ Wait │ Tokenize+Generate │ Server→Client │
└────────────────────────────────────────────────────────┘

Optimization 1: Streaming Responses

Instead of waiting for full generation, stream tokens as they're produced:

# Server-side streaming with FastAPI
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from vllm import LLM, SamplingParams

app = FastAPI()
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

@app.post("/generate/stream")
async def generate_stream(prompt: str):
    async def generate():
        sampling_params = SamplingParams(
            max_tokens=512,
            temperature=0.7
        )
        
        # Stream outputs
        async for output in llm.generate(
            [prompt], 
            sampling_params,
            stream=True
        ):
            token = output.outputs[0].text
            yield f"data: {token}\n\n"
        
        yield "data: [DONE]\n\n"
    
    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )

# Client-side consumption
async def consume_stream():
    async with aiohttp.ClientSession() as session:
        async with session.post(
            "http://api/generate/stream",
            json={"prompt": "Hello"}
        ) as response:
            async for line in response.content:
                if line.startswith(b"data: "):
                    token = line[6:].decode()
                    if token != "[DONE]":
                        print(token, end="", flush=True)

Time to First Token (TTFT)

With streaming, users see output within 50-200ms instead of waiting seconds:

Metric	Without Streaming	With Streaming
Time to first visible output	3-5 seconds	50-200ms
Perceived responsiveness	Slow	Instant
User can cancel early	No	Yes

Optimization 2: Speculative Decoding

Use a smaller model to draft, then verify with large model:

# Speculative decoding in vLLM
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.1-8B-Instruct",
    num_speculative_tokens=5,  # Draft 5 tokens at a time
    use_v2_block_manager=True
)

# 1.5-2x speedup with same output quality
outputs = llm.generate(
    ["Explain quantum computing"],
    SamplingParams(max_tokens=256)
)

Optimization 3: KV Cache Optimization

Prefix Caching

Cache KV values for common prefixes (system prompts):

# vLLM automatic prefix caching
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_prefix_caching=True  # Cache common prefixes
)

# System prompt is cached after first request
system_prompt = """You are a helpful assistant. You provide clear, 
accurate answers. You always cite sources when possible."""

# Subsequent requests with same prefix are faster
for user_query in user_queries:
    prompt = f"{system_prompt}\n\nUser: {user_query}\nAssistant:"
    # First request: ~200ms
    # Subsequent: ~50ms (prefix cached)

PagedAttention

# vLLM uses PagedAttention by default
# Efficiently manages KV cache memory

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    gpu_memory_utilization=0.9,  # Use 90% of VRAM
    max_num_seqs=256,  # Max concurrent requests
)

Optimization 4: Model Optimization

Quantization

# INT8 quantization for 2x speedup
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    quantization="awq",  # or "gptq", "squeezellm"
    dtype="half"
)

# Latency comparison (A100):
# FP16:  45ms/token
# INT8:  25ms/token
# INT4:  18ms/token

TensorRT-LLM

# Maximum performance with TensorRT-LLM
# See our TensorRT guide for details

# Build optimized engine
trtllm-build \
    --checkpoint_dir ./checkpoint \
    --output_dir ./engine \
    --gemm_plugin float16 \
    --max_batch_size 64 \
    --paged_kv_cache enable

# 2-3x faster than vanilla PyTorch

Need Low-Latency GPUs?

H100 GPUs on GPUBrazil deliver 2x faster inference than A100. Perfect for real-time applications.

Try H100 GPUs →

Optimization 5: Infrastructure

Geographic Proximity

# Deploy inference close to users
# 
# User Location → Nearest Inference Region
# US East       → Virginia
# US West       → California  
# Europe        → Frankfurt
# Asia          → Singapore

# Network latency savings: 50-200ms per request

Connection Pooling

# Reuse connections to avoid handshake overhead
import aiohttp

# Create persistent session
connector = aiohttp.TCPConnector(
    limit=100,
    keepalive_timeout=30,
    enable_cleanup_closed=True
)

session = aiohttp.ClientSession(connector=connector)

# Reuse for all requests
# Saves ~50ms per request (TCP + TLS handshake)

GPU Warmup

# Warm up GPU before serving traffic
def warmup_model(llm, num_warmup=10):
    """Run warmup requests to optimize CUDA kernels"""
    warmup_prompt = "Hello, how are you?"
    
    for _ in range(num_warmup):
        llm.generate([warmup_prompt], SamplingParams(max_tokens=10))
    
    print("Model warmed up!")

# First request after cold start: ~500ms
# After warmup: ~50ms

Optimization 6: Batching Strategies

Continuous Batching

# vLLM handles this automatically
# New requests join batch as slots become available

┌─────────────────────────────────────────┐
│ Time →                                   │
├─────────────────────────────────────────┤
│ Req 1: ████████████████                 │
│ Req 2:     ████████████████████         │
│ Req 3:         ████████████             │
│ Req 4:             ████████████████████ │
└─────────────────────────────────────────┘

# Requests complete at different times
# No waiting for batch to fill

Dynamic Batching with Timeout

# Triton dynamic batching config
dynamic_batching {
    preferred_batch_size: [ 4, 8, 16 ]
    max_queue_delay_microseconds: 100000  # 100ms max wait
}

# Trade-off:
# - Lower delay = faster response, lower throughput
# - Higher delay = slower response, higher throughput

Profiling and Measurement

import time
from dataclasses import dataclass

@dataclass
class LatencyMetrics:
    total_ms: float
    ttft_ms: float  # Time to first token
    tpot_ms: float  # Time per output token
    tokens_generated: int

def measure_inference(llm, prompt: str, max_tokens: int = 100):
    start = time.perf_counter()
    first_token_time = None
    tokens = 0
    
    for output in llm.generate([prompt], stream=True):
        if first_token_time is None:
            first_token_time = time.perf_counter()
        tokens += 1
    
    end = time.perf_counter()
    
    total_ms = (end - start) * 1000
    ttft_ms = (first_token_time - start) * 1000
    generation_time = end - first_token_time
    tpot_ms = (generation_time / tokens) * 1000
    
    return LatencyMetrics(
        total_ms=total_ms,
        ttft_ms=ttft_ms,
        tpot_ms=tpot_ms,
        tokens_generated=tokens
    )

# Profile your setup
metrics = measure_inference(llm, "Explain AI in one paragraph")
print(f"TTFT: {metrics.ttft_ms:.0f}ms")
print(f"TPOT: {metrics.tpot_ms:.1f}ms")
print(f"Total: {metrics.total_ms:.0f}ms")

Real-Time Vision Models

import torch
from torchvision import transforms
import time

# Optimize for inference
model = torch.jit.script(model)  # JIT compile
model = model.cuda().half()      # FP16

# Warm up
for _ in range(10):
    with torch.no_grad():
        _ = model(torch.randn(1, 3, 224, 224).cuda().half())

# Benchmark
torch.cuda.synchronize()
start = time.perf_counter()

for _ in range(100):
    with torch.no_grad():
        output = model(input_tensor)

torch.cuda.synchronize()
end = time.perf_counter()

avg_ms = (end - start) / 100 * 1000
print(f"Average latency: {avg_ms:.2f}ms")

# Further optimizations:
# - TensorRT: 2-3x faster
# - INT8 quantization: 2x faster
# - Smaller input resolution: linear speedup

Audio Processing Latency

# Whisper streaming transcription
import numpy as np
from faster_whisper import WhisperModel

model = WhisperModel(
    "large-v3",
    device="cuda",
    compute_type="float16"
)

def transcribe_chunk(audio_chunk: np.ndarray) -> str:
    """Process audio in chunks for lower latency"""
    segments, _ = model.transcribe(
        audio_chunk,
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=500)
    )
    return " ".join(s.text for s in segments)

# Chunk-based processing:
# - Process 5-second chunks
# - Overlap by 0.5 seconds
# - Latency: ~1-2 seconds behind real-time

Latency Checklist

☐ Enable response streaming
☐ Use quantized models (AWQ/GPTQ)
☐ Enable prefix caching
☐ Deploy geographically close to users
☐ Use connection pooling
☐ Warm up models before serving
☐ Profile and measure regularly
☐ Consider speculative decoding for LLMs
☐ Use TensorRT for vision models

Benchmarks by GPU

GPU	LLaMA 8B TTFT	Tokens/sec	Cost/hr
RTX 4090	~80ms	~90	$0.40
A100 40GB	~60ms	~120	$1.50
A100 80GB	~55ms	~130	$2.00
H100 80GB	~35ms	~200	$3.50

Measured with vLLM, FP16, single GPU

Conclusion

Achieving sub-100ms latency requires optimization at every layer:

Model level: Quantization, speculative decoding
Serving level: Streaming, caching, batching
Infrastructure: Geographic placement, connection reuse

Start with streaming—it has the biggest impact on perceived latency. Then profile and optimize the slowest components.

Deploy on GPUBrazil for low-latency GPU infrastructure optimized for real-time AI.