Why Monitor ML Infrastructure?
ML workloads are expensive. Without proper monitoring, you're flying blind:
- GPU utilization: Are you paying for idle GPUs?
- Inference latency: Are users experiencing delays?
- Error rates: Are requests failing silently?
- Cost tracking: How much does each request cost?
π‘ Stack Overview
Prometheus collects metrics, Grafana visualizes them, Alertmanager sends notifications. All open-source and battle-tested.
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your ML Stack β
ββββββββββββββ¬βββββββββββββ¬βββββββββββββ¬ββββββββββββββββββ€
β vLLM β Triton β GPU Node β Training Job β
β /metrics β /metrics β DCGM β Custom metrics β
βββββββ¬βββββββ΄ββββββ¬βββββββ΄ββββββ¬βββββββ΄βββββββββ¬βββββββββ
β β β β
ββββββββββββββ΄βββββββββββββ΄ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββ
β Prometheus β
β (Scrape & Store) β
βββββββββββββ¬ββββββββββββ
β
βββββββββββββββββΌββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββββββ
β Grafana β βAlertmanagerβ β Prometheus β
βDashboard β β Alerts β β Queries β
ββββββββββββ ββββββββββββ ββββββββββββββββ
Setup: Docker Compose
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
restart: always
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
restart: always
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: always
# GPU metrics exporter
dcgm-exporter:
image: nvcr.io/nvidia/k8s/dcgm-exporter:latest
runtime: nvidia
ports:
- "9400:9400"
environment:
- DCGM_EXPORTER_LISTEN=:9400
restart: always
volumes:
prometheus_data:
grafana_data:
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- "alerts/*.yml"
scrape_configs:
# GPU metrics from DCGM
- job_name: 'dcgm'
static_configs:
- targets: ['dcgm-exporter:9400']
labels:
cluster: 'production'
# vLLM inference metrics
- job_name: 'vllm'
static_configs:
- targets: ['vllm-server:8000']
metrics_path: /metrics
# Triton inference metrics
- job_name: 'triton'
static_configs:
- targets: ['triton-server:8002']
metrics_path: /metrics
# Node metrics (CPU, memory, disk)
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
Essential GPU Metrics
DCGM Metrics
# Key GPU metrics from DCGM Exporter
# GPU Utilization (0-100%)
DCGM_FI_DEV_GPU_UTIL
# GPU Memory Used (bytes)
DCGM_FI_DEV_FB_USED
# GPU Memory Total (bytes)
DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED
# GPU Temperature (Celsius)
DCGM_FI_DEV_GPU_TEMP
# Power Usage (Watts)
DCGM_FI_DEV_POWER_USAGE
# SM Clock Speed (MHz)
DCGM_FI_DEV_SM_CLOCK
# Tensor Core Utilization (H100/A100)
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
Grafana Dashboard Queries
# GPU Utilization Gauge
avg(DCGM_FI_DEV_GPU_UTIL{gpu=~"$gpu"})
# GPU Memory Usage %
(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100
# GPU Temperature Heatmap
DCGM_FI_DEV_GPU_TEMP
# Power Consumption Over Time
sum(DCGM_FI_DEV_POWER_USAGE) by (instance)
# Cost Estimation (assuming $2/hr per GPU)
count(DCGM_FI_DEV_GPU_UTIL > 0) * 2 / 3600 # Per second cost
vLLM Metrics
# Key vLLM metrics
# Requests in progress
vllm:num_requests_running
# Requests waiting in queue
vllm:num_requests_waiting
# Request latency histogram
vllm:request_latency_seconds_bucket
# Tokens generated per second
rate(vllm:generation_tokens_total[5m])
# Prompt tokens processed per second
rate(vllm:prompt_tokens_total[5m])
# KV Cache utilization
vllm:gpu_cache_usage_perc
# Request success rate
sum(rate(vllm:request_success_total[5m])) /
sum(rate(vllm:request_total[5m]))
Monitor Your GPU Infrastructure
GPUBrazil instances come with monitoring endpoints pre-configured. Just connect your Prometheus.
Get Started βCustom Application Metrics
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Define metrics
REQUEST_COUNT = Counter(
'inference_requests_total',
'Total inference requests',
['model', 'status']
)
REQUEST_LATENCY = Histogram(
'inference_latency_seconds',
'Inference request latency',
['model'],
buckets=[.05, .1, .25, .5, 1, 2.5, 5, 10]
)
TOKENS_GENERATED = Counter(
'tokens_generated_total',
'Total tokens generated',
['model']
)
ACTIVE_REQUESTS = Gauge(
'active_inference_requests',
'Currently processing requests',
['model']
)
COST_ACCUMULATED = Counter(
'inference_cost_dollars',
'Accumulated inference cost',
['model']
)
# Instrument your inference code
def inference(model_name: str, prompt: str):
ACTIVE_REQUESTS.labels(model=model_name).inc()
start = time.time()
try:
result = run_model(prompt)
# Record metrics
REQUEST_COUNT.labels(model=model_name, status='success').inc()
TOKENS_GENERATED.labels(model=model_name).inc(result.token_count)
# Estimate cost ($0.001 per 1K tokens)
cost = result.token_count * 0.000001
COST_ACCUMULATED.labels(model=model_name).inc(cost)
return result
except Exception as e:
REQUEST_COUNT.labels(model=model_name, status='error').inc()
raise
finally:
REQUEST_LATENCY.labels(model=model_name).observe(time.time() - start)
ACTIVE_REQUESTS.labels(model=model_name).dec()
# Start metrics server
start_http_server(8000) # Prometheus scrapes :8000/metrics
Alerting Rules
# alerts/ml-alerts.yml
groups:
- name: gpu-alerts
rules:
# GPU utilization too low (wasting money)
- alert: GPUUnderutilized
expr: avg(DCGM_FI_DEV_GPU_UTIL) < 20
for: 30m
labels:
severity: warning
annotations:
summary: "GPU utilization below 20% for 30 minutes"
description: "Consider scaling down or consolidating workloads"
# GPU memory almost full
- alert: GPUMemoryHigh
expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.95
for: 5m
labels:
severity: critical
annotations:
summary: "GPU memory above 95%"
description: "Risk of OOM errors. Consider smaller batch size."
# GPU temperature critical
- alert: GPUTemperatureHigh
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 5m
labels:
severity: critical
annotations:
summary: "GPU temperature above 85Β°C"
description: "Check cooling. Throttling may occur."
- name: inference-alerts
rules:
# High latency
- alert: InferenceLatencyHigh
expr: histogram_quantile(0.99, rate(inference_latency_seconds_bucket[5m])) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "P99 inference latency above 5 seconds"
description: "Users may be experiencing slow responses"
# High error rate
- alert: InferenceErrorRateHigh
expr: |
sum(rate(inference_requests_total{status="error"}[5m])) /
sum(rate(inference_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Inference error rate above 5%"
description: "Check model health and logs"
# Queue building up
- alert: InferenceQueueBacklog
expr: vllm:num_requests_waiting > 50
for: 5m
labels:
severity: warning
annotations:
summary: "Inference queue has 50+ waiting requests"
description: "Consider scaling up inference replicas"
Alertmanager Configuration
# alertmanager.yml
global:
slack_api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
- match:
severity: warning
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#ml-alerts'
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
severity: critical
Grafana Dashboard
// grafana/provisioning/dashboards/ml-overview.json
{
"title": "ML Infrastructure Overview",
"panels": [
{
"title": "GPU Utilization",
"type": "gauge",
"targets": [{
"expr": "avg(DCGM_FI_DEV_GPU_UTIL)",
"legendFormat": "Utilization %"
}],
"fieldConfig": {
"defaults": {
"max": 100,
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 50},
{"color": "green", "value": 80}
]
}
}
}
},
{
"title": "Inference Latency P99",
"type": "graph",
"targets": [{
"expr": "histogram_quantile(0.99, rate(inference_latency_seconds_bucket[5m]))",
"legendFormat": "P99 Latency"
}]
},
{
"title": "Requests per Second",
"type": "graph",
"targets": [{
"expr": "sum(rate(inference_requests_total[1m]))",
"legendFormat": "RPS"
}]
},
{
"title": "Estimated Hourly Cost",
"type": "stat",
"targets": [{
"expr": "sum(rate(inference_cost_dollars[1h])) * 3600",
"legendFormat": "$/hour"
}]
}
]
}
Cost Monitoring
# Track and alert on costs
# Estimated hourly GPU cost
gpu_hourly_cost = (
count(DCGM_FI_DEV_GPU_UTIL > 0{gpu_type="a100"}) * 1.50 +
count(DCGM_FI_DEV_GPU_UTIL > 0{gpu_type="h100"}) * 3.50
)
# Cost per 1000 requests
cost_per_1k_requests = (
sum(rate(inference_cost_dollars[1h])) * 3600 /
sum(rate(inference_requests_total[1h])) * 1000
)
# Daily cost projection
daily_cost_projection = gpu_hourly_cost * 24
# Alert if daily cost exceeds budget
- alert: DailyCostExceedsBudget
expr: (sum(gpu_hourly_cost) * 24) > 500 # $500/day budget
for: 1h
labels:
severity: warning
annotations:
summary: "Projected daily cost exceeds $500"
Training Job Monitoring
from prometheus_client import Gauge, Counter
import torch
# Training metrics
TRAINING_LOSS = Gauge('training_loss', 'Current training loss', ['experiment'])
TRAINING_STEP = Counter('training_steps_total', 'Total training steps', ['experiment'])
LEARNING_RATE = Gauge('learning_rate', 'Current learning rate', ['experiment'])
GRADIENT_NORM = Gauge('gradient_norm', 'Gradient L2 norm', ['experiment'])
SAMPLES_PER_SEC = Gauge('training_samples_per_second', 'Training throughput', ['experiment'])
class PrometheusCallback:
def __init__(self, experiment_name: str):
self.experiment = experiment_name
def on_step(self, loss, lr, grad_norm, samples_per_sec):
TRAINING_LOSS.labels(experiment=self.experiment).set(loss)
LEARNING_RATE.labels(experiment=self.experiment).set(lr)
GRADIENT_NORM.labels(experiment=self.experiment).set(grad_norm)
SAMPLES_PER_SEC.labels(experiment=self.experiment).set(samples_per_sec)
TRAINING_STEP.labels(experiment=self.experiment).inc()
# Use in training loop
callback = PrometheusCallback("llama-finetune-v1")
for step, batch in enumerate(dataloader):
loss = train_step(batch)
callback.on_step(
loss=loss.item(),
lr=scheduler.get_last_lr()[0],
grad_norm=get_gradient_norm(model),
samples_per_sec=batch_size / step_time
)
Best Practices
- Retention: Keep 30 days of high-resolution data, 1 year of downsampled
- Labels: Use consistent labels (model, gpu_type, environment)
- Cardinality: Avoid high-cardinality labels (user IDs, request IDs)
- Alerts: Start with few critical alerts, expand gradually
- Dashboards: Create role-specific dashboards (ops, ML team, business)
β οΈ Metrics Overhead
Scraping too frequently or tracking too many metrics can impact inference performance. Start with 15s intervals and essential metrics.
Conclusion
Effective ML monitoring requires tracking:
- GPU metrics: Utilization, memory, temperature
- Inference metrics: Latency, throughput, errors
- Business metrics: Cost per request, daily spend
Start with the basicsβGPU utilization and request latencyβthen expand as you learn your system's behavior.
Monitor your ML workloads on GPUBrazil with built-in metrics endpoints and easy Prometheus integration.