Kubernetes for ML: Deploy and Scale AI Models in Production

Why Kubernetes for ML?

Kubernetes has become the de facto standard for deploying ML workloads in production. Here's why:

GPU scheduling: Efficiently manage expensive GPU resources
Autoscaling: Scale inference replicas based on demand
Rolling updates: Zero-downtime model deployments
Resource isolation: Separate training from inference
Reproducibility: Containerized, versioned deployments

💡 This Guide Covers

GPU setup, model serving with Triton/vLLM, autoscaling, monitoring, and production best practices.

Setting Up GPU Nodes

NVIDIA Device Plugin

# Install NVIDIA device plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# Verify GPU nodes
kubectl get nodes -l nvidia.com/gpu.present=true

# Check GPU allocations
kubectl describe node gpu-node-1 | grep -A 10 "Allocated resources"

Node Labels for GPU Types

# Label nodes by GPU type
kubectl label nodes gpu-node-1 gpu-type=a100
kubectl label nodes gpu-node-2 gpu-type=h100
kubectl label nodes gpu-node-3 gpu-type=rtx4090

# Label by VRAM
kubectl label nodes gpu-node-1 gpu-memory=80gb
kubectl label nodes gpu-node-2 gpu-memory=80gb

Basic ML Deployment

Simple Inference Server

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
  labels:
    app: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3.1-8B-Instruct"
          - "--tensor-parallel-size"
          - "1"
          - "--max-model-len"
          - "8192"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            memory: "32Gi"
            cpu: "8"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180
          periodSeconds: 30
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
      nodeSelector:
        gpu-type: a100
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

Service and Ingress

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: llm-inference-svc
spec:
  selector:
    app: llm-inference
  ports:
  - port: 80
    targetPort: 8000
  type: ClusterIP
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
  ingressClassName: nginx
  rules:
  - host: api.yourdomain.com
    http:
      paths:
      - path: /v1
        pathType: Prefix
        backend:
          service:
            name: llm-inference-svc
            port:
              number: 80

Multi-GPU Models with Tensor Parallelism

# large-model-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-70b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-70b
  template:
    metadata:
      labels:
        app: llama-70b
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3.1-70B-Instruct"
          - "--tensor-parallel-size"
          - "4"  # Spread across 4 GPUs
          - "--max-model-len"
          - "8192"
          - "--gpu-memory-utilization"
          - "0.9"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 4  # Request 4 GPUs
          requests:
            memory: "320Gi"
            cpu: "32"
      nodeSelector:
        gpu-memory: 80gb  # Need high VRAM GPUs

⚠️ Multi-GPU Scheduling

All requested GPUs must be on the same node for tensor parallelism. Ensure nodes have enough GPUs.

GPU-Aware Autoscaling

Horizontal Pod Autoscaler

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scale down
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120

Custom Metrics with Prometheus

# prometheus-adapter-config.yaml
rules:
- seriesQuery: 'vllm:num_requests_running{namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  name:
    matches: "^(.*)$"
    as: "vllm_active_requests"
  metricsQuery: 'sum(vllm:num_requests_running{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

- seriesQuery: 'vllm:gpu_cache_usage_perc{namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  name:
    matches: "^(.*)$"
    as: "vllm_gpu_cache_usage"
  metricsQuery: 'avg(vllm:gpu_cache_usage_perc{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

NVIDIA Triton Inference Server

Model Repository Structure

# models/
# └── llama_8b/
#     ├── config.pbtxt
#     └── 1/
#         └── model.py

# config.pbtxt
name: "llama_8b"
backend: "vllm"
max_batch_size: 64

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]

output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

parameters: {
  key: "model"
  value: {string_value: "meta-llama/Llama-3.1-8B-Instruct"}
}

parameters: {
  key: "max_tokens"
  value: {string_value: "2048"}
}

Triton Deployment

# triton-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton-server
  template:
    metadata:
      labels:
        app: triton-server
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.01-vllm-python-py3
        args:
          - tritonserver
          - --model-repository=/models
          - --strict-model-config=false
          - --log-verbose=1
        ports:
        - containerPort: 8000  # HTTP
        - containerPort: 8001  # gRPC
        - containerPort: 8002  # Metrics
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            memory: "32Gi"
            cpu: "8"
        volumeMounts:
        - name: models
          mountPath: /models
        livenessProbe:
          httpGet:
            path: /v2/health/live
            port: 8000
          initialDelaySeconds: 60
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: model-repository-pvc

Model Versioning with Argo Rollouts

# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: llm-inference
spec:
  replicas: 4
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.3.0
        # ... container spec
  strategy:
    canary:
      steps:
      - setWeight: 10      # Send 10% traffic to new version
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: latency-check
      - setWeight: 50      # If analysis passes, 50%
      - pause: {duration: 10m}
      - analysis:
          templates:
          - templateName: latency-check
---
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: latency-check
spec:
  metrics:
  - name: p99-latency
    interval: 1m
    successCondition: result < 2.0  # P99 under 2 seconds
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          histogram_quantile(0.99, 
            sum(rate(vllm_request_latency_bucket{app="llm-inference"}[5m])) 
            by (le)
          )

Production-Ready GPU Infrastructure

Deploy your ML workloads on GPUBrazil's managed Kubernetes clusters with H100 and A100 GPUs.

Get Started →

Monitoring Stack

Prometheus ServiceMonitor

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: llm-inference-monitor
spec:
  selector:
    matchLabels:
      app: llm-inference
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

Key Metrics to Monitor

# Grafana Dashboard Queries

# GPU Utilization
nvidia_gpu_duty_cycle{pod=~"llm-.*"}

# GPU Memory Usage
nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100

# Request Latency P99
histogram_quantile(0.99, sum(rate(vllm_request_latency_bucket[5m])) by (le))

# Tokens per Second
rate(vllm_generation_tokens_total[5m])

# Queue Depth
vllm_num_requests_waiting

# Active Requests
vllm_num_requests_running

Cost Optimization

Spot/Preemptible Nodes for Training

# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: fine-tuning-job
spec:
  template:
    spec:
      nodeSelector:
        cloud.google.com/gke-spot: "true"  # GKE Spot
        # OR
        # kubernetes.azure.com/scalesetpriority: spot  # AKS Spot
      tolerations:
      - key: cloud.google.com/gke-spot
        operator: Equal
        value: "true"
        effect: NoSchedule
      containers:
      - name: trainer
        image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
        resources:
          limits:
            nvidia.com/gpu: 8
      restartPolicy: OnFailure
  backoffLimit: 3

GPU Sharing with MPS

# For smaller models, share GPUs
apiVersion: v1
kind: Pod
metadata:
  name: shared-gpu-pod
spec:
  containers:
  - name: model-a
    resources:
      limits:
        nvidia.com/gpu: 1  # With MPS, multiple pods can share
        nvidia.com/mps.percentage: "50"  # Request 50% of GPU

Security Best Practices

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: llm-inference-policy
spec:
  podSelector:
    matchLabels:
      app: llm-inference
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    ports:
    - port: 8000
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - port: 443  # HTTPS for model downloads
---
# pod-security.yaml
apiVersion: v1
kind: Pod
metadata:
  name: secure-inference
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
  containers:
  - name: vllm
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop: ["ALL"]

Complete Production Setup

# Full deployment with all best practices
kubectl apply -f - <


            
            Conclusion
            
            Kubernetes provides the foundation for production ML deployments at scale. Key takeaways:
            
            
                Use node selectors to target specific GPU types
                Implement proper health checks for model warm-up time
                Configure autoscaling based on custom metrics
                Use canary deployments for safe model updates
                Monitor GPU utilization and request latency
            
            
            Whether you're running your own cluster or using managed infrastructure like GPUBrazil, these patterns will help you build reliable, scalable ML systems.