Why Kubernetes for ML?

Kubernetes has become the de facto standard for deploying ML workloads in production. Here's why:

๐Ÿ’ก This Guide Covers

GPU setup, model serving with Triton/vLLM, autoscaling, monitoring, and production best practices.

Setting Up GPU Nodes

NVIDIA Device Plugin

# Install NVIDIA device plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# Verify GPU nodes
kubectl get nodes -l nvidia.com/gpu.present=true

# Check GPU allocations
kubectl describe node gpu-node-1 | grep -A 10 "Allocated resources"

Node Labels for GPU Types

# Label nodes by GPU type
kubectl label nodes gpu-node-1 gpu-type=a100
kubectl label nodes gpu-node-2 gpu-type=h100
kubectl label nodes gpu-node-3 gpu-type=rtx4090

# Label by VRAM
kubectl label nodes gpu-node-1 gpu-memory=80gb
kubectl label nodes gpu-node-2 gpu-memory=80gb

Basic ML Deployment

Simple Inference Server

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
  labels:
    app: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3.1-8B-Instruct"
          - "--tensor-parallel-size"
          - "1"
          - "--max-model-len"
          - "8192"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            memory: "32Gi"
            cpu: "8"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180
          periodSeconds: 30
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
      nodeSelector:
        gpu-type: a100
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

Service and Ingress

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: llm-inference-svc
spec:
  selector:
    app: llm-inference
  ports:
  - port: 80
    targetPort: 8000
  type: ClusterIP
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
  ingressClassName: nginx
  rules:
  - host: api.yourdomain.com
    http:
      paths:
      - path: /v1
        pathType: Prefix
        backend:
          service:
            name: llm-inference-svc
            port:
              number: 80

Multi-GPU Models with Tensor Parallelism

# large-model-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-70b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-70b
  template:
    metadata:
      labels:
        app: llama-70b
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3.1-70B-Instruct"
          - "--tensor-parallel-size"
          - "4"  # Spread across 4 GPUs
          - "--max-model-len"
          - "8192"
          - "--gpu-memory-utilization"
          - "0.9"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 4  # Request 4 GPUs
          requests:
            memory: "320Gi"
            cpu: "32"
      nodeSelector:
        gpu-memory: 80gb  # Need high VRAM GPUs

โš ๏ธ Multi-GPU Scheduling

All requested GPUs must be on the same node for tensor parallelism. Ensure nodes have enough GPUs.

GPU-Aware Autoscaling

Horizontal Pod Autoscaler

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scale down
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120

Custom Metrics with Prometheus

# prometheus-adapter-config.yaml
rules:
- seriesQuery: 'vllm:num_requests_running{namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  name:
    matches: "^(.*)$"
    as: "vllm_active_requests"
  metricsQuery: 'sum(vllm:num_requests_running{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

- seriesQuery: 'vllm:gpu_cache_usage_perc{namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  name:
    matches: "^(.*)$"
    as: "vllm_gpu_cache_usage"
  metricsQuery: 'avg(vllm:gpu_cache_usage_perc{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

NVIDIA Triton Inference Server

Model Repository Structure

# models/
# โ””โ”€โ”€ llama_8b/
#     โ”œโ”€โ”€ config.pbtxt
#     โ””โ”€โ”€ 1/
#         โ””โ”€โ”€ model.py

# config.pbtxt
name: "llama_8b"
backend: "vllm"
max_batch_size: 64

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]

output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

parameters: {
  key: "model"
  value: {string_value: "meta-llama/Llama-3.1-8B-Instruct"}
}

parameters: {
  key: "max_tokens"
  value: {string_value: "2048"}
}

Triton Deployment

# triton-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton-server
  template:
    metadata:
      labels:
        app: triton-server
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.01-vllm-python-py3
        args:
          - tritonserver
          - --model-repository=/models
          - --strict-model-config=false
          - --log-verbose=1
        ports:
        - containerPort: 8000  # HTTP
        - containerPort: 8001  # gRPC
        - containerPort: 8002  # Metrics
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            memory: "32Gi"
            cpu: "8"
        volumeMounts:
        - name: models
          mountPath: /models
        livenessProbe:
          httpGet:
            path: /v2/health/live
            port: 8000
          initialDelaySeconds: 60
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: model-repository-pvc

Model Versioning with Argo Rollouts

# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: llm-inference
spec:
  replicas: 4
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.3.0
        # ... container spec
  strategy:
    canary:
      steps:
      - setWeight: 10      # Send 10% traffic to new version
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: latency-check
      - setWeight: 50      # If analysis passes, 50%
      - pause: {duration: 10m}
      - analysis:
          templates:
          - templateName: latency-check
---
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: latency-check
spec:
  metrics:
  - name: p99-latency
    interval: 1m
    successCondition: result < 2.0  # P99 under 2 seconds
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          histogram_quantile(0.99, 
            sum(rate(vllm_request_latency_bucket{app="llm-inference"}[5m])) 
            by (le)
          )

Production-Ready GPU Infrastructure

Deploy your ML workloads on GPUBrazil's managed Kubernetes clusters with H100 and A100 GPUs.

Get Started โ†’

Monitoring Stack

Prometheus ServiceMonitor

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: llm-inference-monitor
spec:
  selector:
    matchLabels:
      app: llm-inference
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

Key Metrics to Monitor

# Grafana Dashboard Queries

# GPU Utilization
nvidia_gpu_duty_cycle{pod=~"llm-.*"}

# GPU Memory Usage
nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100

# Request Latency P99
histogram_quantile(0.99, sum(rate(vllm_request_latency_bucket[5m])) by (le))

# Tokens per Second
rate(vllm_generation_tokens_total[5m])

# Queue Depth
vllm_num_requests_waiting

# Active Requests
vllm_num_requests_running

Cost Optimization

Spot/Preemptible Nodes for Training

# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: fine-tuning-job
spec:
  template:
    spec:
      nodeSelector:
        cloud.google.com/gke-spot: "true"  # GKE Spot
        # OR
        # kubernetes.azure.com/scalesetpriority: spot  # AKS Spot
      tolerations:
      - key: cloud.google.com/gke-spot
        operator: Equal
        value: "true"
        effect: NoSchedule
      containers:
      - name: trainer
        image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
        resources:
          limits:
            nvidia.com/gpu: 8
      restartPolicy: OnFailure
  backoffLimit: 3

GPU Sharing with MPS

# For smaller models, share GPUs
apiVersion: v1
kind: Pod
metadata:
  name: shared-gpu-pod
spec:
  containers:
  - name: model-a
    resources:
      limits:
        nvidia.com/gpu: 1  # With MPS, multiple pods can share
        nvidia.com/mps.percentage: "50"  # Request 50% of GPU

Security Best Practices

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: llm-inference-policy
spec:
  podSelector:
    matchLabels:
      app: llm-inference
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    ports:
    - port: 8000
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - port: 443  # HTTPS for model downloads
---
# pod-security.yaml
apiVersion: v1
kind: Pod
metadata:
  name: secure-inference
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
  containers:
  - name: vllm
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop: ["ALL"]

Complete Production Setup

# Full deployment with all best practices
kubectl apply -f - <

Conclusion

Kubernetes provides the foundation for production ML deployments at scale. Key takeaways:

  • Use node selectors to target specific GPU types
  • Implement proper health checks for model warm-up time
  • Configure autoscaling based on custom metrics
  • Use canary deployments for safe model updates
  • Monitor GPU utilization and request latency

Whether you're running your own cluster or using managed infrastructure like GPUBrazil, these patterns will help you build reliable, scalable ML systems.