Why Monitor ML Infrastructure?

ML workloads are expensive. Without proper monitoring, you're flying blind:

πŸ’‘ Stack Overview

Prometheus collects metrics, Grafana visualizes them, Alertmanager sends notifications. All open-source and battle-tested.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Your ML Stack                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ vLLM       β”‚ Triton     β”‚ GPU Node   β”‚ Training Job    β”‚
β”‚ /metrics   β”‚ /metrics   β”‚ DCGM       β”‚ Custom metrics  β”‚
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚            β”‚            β”‚               β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚     Prometheus        β”‚
              β”‚   (Scrape & Store)    β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚               β”‚               β”‚
          β–Ό               β–Ό               β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Grafana  β”‚   β”‚Alertmanagerβ”‚  β”‚ Prometheus  β”‚
    β”‚Dashboard β”‚   β”‚  Alerts   β”‚   β”‚   Queries   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Setup: Docker Compose

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    restart: always

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    restart: always

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: always

  # GPU metrics exporter
  dcgm-exporter:
    image: nvcr.io/nvidia/k8s/dcgm-exporter:latest
    runtime: nvidia
    ports:
      - "9400:9400"
    environment:
      - DCGM_EXPORTER_LISTEN=:9400
    restart: always

volumes:
  prometheus_data:
  grafana_data:

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alerts/*.yml"

scrape_configs:
  # GPU metrics from DCGM
  - job_name: 'dcgm'
    static_configs:
      - targets: ['dcgm-exporter:9400']
        labels:
          cluster: 'production'

  # vLLM inference metrics
  - job_name: 'vllm'
    static_configs:
      - targets: ['vllm-server:8000']
    metrics_path: /metrics

  # Triton inference metrics
  - job_name: 'triton'
    static_configs:
      - targets: ['triton-server:8002']
    metrics_path: /metrics

  # Node metrics (CPU, memory, disk)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

Essential GPU Metrics

DCGM Metrics

# Key GPU metrics from DCGM Exporter

# GPU Utilization (0-100%)
DCGM_FI_DEV_GPU_UTIL

# GPU Memory Used (bytes)
DCGM_FI_DEV_FB_USED

# GPU Memory Total (bytes)  
DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED

# GPU Temperature (Celsius)
DCGM_FI_DEV_GPU_TEMP

# Power Usage (Watts)
DCGM_FI_DEV_POWER_USAGE

# SM Clock Speed (MHz)
DCGM_FI_DEV_SM_CLOCK

# Tensor Core Utilization (H100/A100)
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

Grafana Dashboard Queries

# GPU Utilization Gauge
avg(DCGM_FI_DEV_GPU_UTIL{gpu=~"$gpu"})

# GPU Memory Usage %
(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100

# GPU Temperature Heatmap
DCGM_FI_DEV_GPU_TEMP

# Power Consumption Over Time
sum(DCGM_FI_DEV_POWER_USAGE) by (instance)

# Cost Estimation (assuming $2/hr per GPU)
count(DCGM_FI_DEV_GPU_UTIL > 0) * 2 / 3600 # Per second cost

vLLM Metrics

# Key vLLM metrics

# Requests in progress
vllm:num_requests_running

# Requests waiting in queue
vllm:num_requests_waiting

# Request latency histogram
vllm:request_latency_seconds_bucket

# Tokens generated per second
rate(vllm:generation_tokens_total[5m])

# Prompt tokens processed per second
rate(vllm:prompt_tokens_total[5m])

# KV Cache utilization
vllm:gpu_cache_usage_perc

# Request success rate
sum(rate(vllm:request_success_total[5m])) / 
sum(rate(vllm:request_total[5m]))

Monitor Your GPU Infrastructure

GPUBrazil instances come with monitoring endpoints pre-configured. Just connect your Prometheus.

Get Started β†’

Custom Application Metrics

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
REQUEST_COUNT = Counter(
    'inference_requests_total',
    'Total inference requests',
    ['model', 'status']
)

REQUEST_LATENCY = Histogram(
    'inference_latency_seconds',
    'Inference request latency',
    ['model'],
    buckets=[.05, .1, .25, .5, 1, 2.5, 5, 10]
)

TOKENS_GENERATED = Counter(
    'tokens_generated_total',
    'Total tokens generated',
    ['model']
)

ACTIVE_REQUESTS = Gauge(
    'active_inference_requests',
    'Currently processing requests',
    ['model']
)

COST_ACCUMULATED = Counter(
    'inference_cost_dollars',
    'Accumulated inference cost',
    ['model']
)

# Instrument your inference code
def inference(model_name: str, prompt: str):
    ACTIVE_REQUESTS.labels(model=model_name).inc()
    
    start = time.time()
    try:
        result = run_model(prompt)
        
        # Record metrics
        REQUEST_COUNT.labels(model=model_name, status='success').inc()
        TOKENS_GENERATED.labels(model=model_name).inc(result.token_count)
        
        # Estimate cost ($0.001 per 1K tokens)
        cost = result.token_count * 0.000001
        COST_ACCUMULATED.labels(model=model_name).inc(cost)
        
        return result
    except Exception as e:
        REQUEST_COUNT.labels(model=model_name, status='error').inc()
        raise
    finally:
        REQUEST_LATENCY.labels(model=model_name).observe(time.time() - start)
        ACTIVE_REQUESTS.labels(model=model_name).dec()

# Start metrics server
start_http_server(8000)  # Prometheus scrapes :8000/metrics

Alerting Rules

# alerts/ml-alerts.yml
groups:
  - name: gpu-alerts
    rules:
      # GPU utilization too low (wasting money)
      - alert: GPUUnderutilized
        expr: avg(DCGM_FI_DEV_GPU_UTIL) < 20
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "GPU utilization below 20% for 30 minutes"
          description: "Consider scaling down or consolidating workloads"

      # GPU memory almost full
      - alert: GPUMemoryHigh
        expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GPU memory above 95%"
          description: "Risk of OOM errors. Consider smaller batch size."

      # GPU temperature critical
      - alert: GPUTemperatureHigh
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GPU temperature above 85Β°C"
          description: "Check cooling. Throttling may occur."

  - name: inference-alerts
    rules:
      # High latency
      - alert: InferenceLatencyHigh
        expr: histogram_quantile(0.99, rate(inference_latency_seconds_bucket[5m])) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99 inference latency above 5 seconds"
          description: "Users may be experiencing slow responses"

      # High error rate
      - alert: InferenceErrorRateHigh
        expr: |
          sum(rate(inference_requests_total{status="error"}[5m])) /
          sum(rate(inference_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Inference error rate above 5%"
          description: "Check model health and logs"

      # Queue building up
      - alert: InferenceQueueBacklog
        expr: vllm:num_requests_waiting > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Inference queue has 50+ waiting requests"
          description: "Consider scaling up inference replicas"

Alertmanager Configuration

# alertmanager.yml
global:
  slack_api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'
  
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
    - match:
        severity: warning
      receiver: 'slack-notifications'

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#ml-alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
        severity: critical

Grafana Dashboard

// grafana/provisioning/dashboards/ml-overview.json
{
  "title": "ML Infrastructure Overview",
  "panels": [
    {
      "title": "GPU Utilization",
      "type": "gauge",
      "targets": [{
        "expr": "avg(DCGM_FI_DEV_GPU_UTIL)",
        "legendFormat": "Utilization %"
      }],
      "fieldConfig": {
        "defaults": {
          "max": 100,
          "thresholds": {
            "steps": [
              {"color": "red", "value": 0},
              {"color": "yellow", "value": 50},
              {"color": "green", "value": 80}
            ]
          }
        }
      }
    },
    {
      "title": "Inference Latency P99",
      "type": "graph",
      "targets": [{
        "expr": "histogram_quantile(0.99, rate(inference_latency_seconds_bucket[5m]))",
        "legendFormat": "P99 Latency"
      }]
    },
    {
      "title": "Requests per Second",
      "type": "graph", 
      "targets": [{
        "expr": "sum(rate(inference_requests_total[1m]))",
        "legendFormat": "RPS"
      }]
    },
    {
      "title": "Estimated Hourly Cost",
      "type": "stat",
      "targets": [{
        "expr": "sum(rate(inference_cost_dollars[1h])) * 3600",
        "legendFormat": "$/hour"
      }]
    }
  ]
}

Cost Monitoring

# Track and alert on costs

# Estimated hourly GPU cost
gpu_hourly_cost = (
  count(DCGM_FI_DEV_GPU_UTIL > 0{gpu_type="a100"}) * 1.50 +
  count(DCGM_FI_DEV_GPU_UTIL > 0{gpu_type="h100"}) * 3.50
)

# Cost per 1000 requests
cost_per_1k_requests = (
  sum(rate(inference_cost_dollars[1h])) * 3600 /
  sum(rate(inference_requests_total[1h])) * 1000
)

# Daily cost projection
daily_cost_projection = gpu_hourly_cost * 24

# Alert if daily cost exceeds budget
- alert: DailyCostExceedsBudget
  expr: (sum(gpu_hourly_cost) * 24) > 500  # $500/day budget
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Projected daily cost exceeds $500"

Training Job Monitoring

from prometheus_client import Gauge, Counter
import torch

# Training metrics
TRAINING_LOSS = Gauge('training_loss', 'Current training loss', ['experiment'])
TRAINING_STEP = Counter('training_steps_total', 'Total training steps', ['experiment'])
LEARNING_RATE = Gauge('learning_rate', 'Current learning rate', ['experiment'])
GRADIENT_NORM = Gauge('gradient_norm', 'Gradient L2 norm', ['experiment'])
SAMPLES_PER_SEC = Gauge('training_samples_per_second', 'Training throughput', ['experiment'])

class PrometheusCallback:
    def __init__(self, experiment_name: str):
        self.experiment = experiment_name
    
    def on_step(self, loss, lr, grad_norm, samples_per_sec):
        TRAINING_LOSS.labels(experiment=self.experiment).set(loss)
        LEARNING_RATE.labels(experiment=self.experiment).set(lr)
        GRADIENT_NORM.labels(experiment=self.experiment).set(grad_norm)
        SAMPLES_PER_SEC.labels(experiment=self.experiment).set(samples_per_sec)
        TRAINING_STEP.labels(experiment=self.experiment).inc()

# Use in training loop
callback = PrometheusCallback("llama-finetune-v1")

for step, batch in enumerate(dataloader):
    loss = train_step(batch)
    
    callback.on_step(
        loss=loss.item(),
        lr=scheduler.get_last_lr()[0],
        grad_norm=get_gradient_norm(model),
        samples_per_sec=batch_size / step_time
    )

Best Practices

⚠️ Metrics Overhead

Scraping too frequently or tracking too many metrics can impact inference performance. Start with 15s intervals and essential metrics.

Conclusion

Effective ML monitoring requires tracking:

Start with the basicsβ€”GPU utilization and request latencyβ€”then expand as you learn your system's behavior.

Monitor your ML workloads on GPUBrazil with built-in metrics endpoints and easy Prometheus integration.