Why Kubernetes for ML?
Kubernetes has become the de facto standard for deploying ML workloads in production. Here's why:
- GPU scheduling: Efficiently manage expensive GPU resources
- Autoscaling: Scale inference replicas based on demand
- Rolling updates: Zero-downtime model deployments
- Resource isolation: Separate training from inference
- Reproducibility: Containerized, versioned deployments
๐ก This Guide Covers
GPU setup, model serving with Triton/vLLM, autoscaling, monitoring, and production best practices.
Setting Up GPU Nodes
NVIDIA Device Plugin
# Install NVIDIA device plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
# Verify GPU nodes
kubectl get nodes -l nvidia.com/gpu.present=true
# Check GPU allocations
kubectl describe node gpu-node-1 | grep -A 10 "Allocated resources"
Node Labels for GPU Types
# Label nodes by GPU type
kubectl label nodes gpu-node-1 gpu-type=a100
kubectl label nodes gpu-node-2 gpu-type=h100
kubectl label nodes gpu-node-3 gpu-type=rtx4090
# Label by VRAM
kubectl label nodes gpu-node-1 gpu-memory=80gb
kubectl label nodes gpu-node-2 gpu-memory=80gb
Basic ML Deployment
Simple Inference Server
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
labels:
app: llm-inference
spec:
replicas: 2
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--tensor-parallel-size"
- "1"
- "--max-model-len"
- "8192"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
requests:
memory: "32Gi"
cpu: "8"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
nodeSelector:
gpu-type: a100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Service and Ingress
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: llm-inference-svc
spec:
selector:
app: llm-inference
ports:
- port: 80
targetPort: 8000
type: ClusterIP
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: llm-ingress
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
ingressClassName: nginx
rules:
- host: api.yourdomain.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service:
name: llm-inference-svc
port:
number: 80
Multi-GPU Models with Tensor Parallelism
# large-model-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-70b
spec:
replicas: 1
selector:
matchLabels:
app: llama-70b
template:
metadata:
labels:
app: llama-70b
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-70B-Instruct"
- "--tensor-parallel-size"
- "4" # Spread across 4 GPUs
- "--max-model-len"
- "8192"
- "--gpu-memory-utilization"
- "0.9"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 4 # Request 4 GPUs
requests:
memory: "320Gi"
cpu: "32"
nodeSelector:
gpu-memory: 80gb # Need high VRAM GPUs
โ ๏ธ Multi-GPU Scheduling
All requested GPUs must be on the same node for tensor parallelism. Ensure nodes have enough GPUs.
GPU-Aware Autoscaling
Horizontal Pod Autoscaler
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-inference-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scale down
policies:
- type: Pods
value: 1
periodSeconds: 120
Custom Metrics with Prometheus
# prometheus-adapter-config.yaml
rules:
- seriesQuery: 'vllm:num_requests_running{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)$"
as: "vllm_active_requests"
metricsQuery: 'sum(vllm:num_requests_running{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
- seriesQuery: 'vllm:gpu_cache_usage_perc{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)$"
as: "vllm_gpu_cache_usage"
metricsQuery: 'avg(vllm:gpu_cache_usage_perc{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
NVIDIA Triton Inference Server
Model Repository Structure
# models/
# โโโ llama_8b/
# โโโ config.pbtxt
# โโโ 1/
# โโโ model.py
# config.pbtxt
name: "llama_8b"
backend: "vllm"
max_batch_size: 64
input [
{
name: "text_input"
data_type: TYPE_STRING
dims: [ 1 ]
}
]
output [
{
name: "text_output"
data_type: TYPE_STRING
dims: [ 1 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
}
]
parameters: {
key: "model"
value: {string_value: "meta-llama/Llama-3.1-8B-Instruct"}
}
parameters: {
key: "max_tokens"
value: {string_value: "2048"}
}
Triton Deployment
# triton-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-server
spec:
replicas: 2
selector:
matchLabels:
app: triton-server
template:
metadata:
labels:
app: triton-server
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.01-vllm-python-py3
args:
- tritonserver
- --model-repository=/models
- --strict-model-config=false
- --log-verbose=1
ports:
- containerPort: 8000 # HTTP
- containerPort: 8001 # gRPC
- containerPort: 8002 # Metrics
resources:
limits:
nvidia.com/gpu: 1
requests:
memory: "32Gi"
cpu: "8"
volumeMounts:
- name: models
mountPath: /models
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 60
volumes:
- name: models
persistentVolumeClaim:
claimName: model-repository-pvc
Model Versioning with Argo Rollouts
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: llm-inference
spec:
replicas: 4
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.3.0
# ... container spec
strategy:
canary:
steps:
- setWeight: 10 # Send 10% traffic to new version
- pause: {duration: 5m}
- analysis:
templates:
- templateName: latency-check
- setWeight: 50 # If analysis passes, 50%
- pause: {duration: 10m}
- analysis:
templates:
- templateName: latency-check
---
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: latency-check
spec:
metrics:
- name: p99-latency
interval: 1m
successCondition: result < 2.0 # P99 under 2 seconds
provider:
prometheus:
address: http://prometheus:9090
query: |
histogram_quantile(0.99,
sum(rate(vllm_request_latency_bucket{app="llm-inference"}[5m]))
by (le)
)
Production-Ready GPU Infrastructure
Deploy your ML workloads on GPUBrazil's managed Kubernetes clusters with H100 and A100 GPUs.
Get Started โMonitoring Stack
Prometheus ServiceMonitor
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: llm-inference-monitor
spec:
selector:
matchLabels:
app: llm-inference
endpoints:
- port: metrics
interval: 15s
path: /metrics
Key Metrics to Monitor
# Grafana Dashboard Queries
# GPU Utilization
nvidia_gpu_duty_cycle{pod=~"llm-.*"}
# GPU Memory Usage
nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100
# Request Latency P99
histogram_quantile(0.99, sum(rate(vllm_request_latency_bucket[5m])) by (le))
# Tokens per Second
rate(vllm_generation_tokens_total[5m])
# Queue Depth
vllm_num_requests_waiting
# Active Requests
vllm_num_requests_running
Cost Optimization
Spot/Preemptible Nodes for Training
# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: fine-tuning-job
spec:
template:
spec:
nodeSelector:
cloud.google.com/gke-spot: "true" # GKE Spot
# OR
# kubernetes.azure.com/scalesetpriority: spot # AKS Spot
tolerations:
- key: cloud.google.com/gke-spot
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: trainer
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
resources:
limits:
nvidia.com/gpu: 8
restartPolicy: OnFailure
backoffLimit: 3
GPU Sharing with MPS
# For smaller models, share GPUs
apiVersion: v1
kind: Pod
metadata:
name: shared-gpu-pod
spec:
containers:
- name: model-a
resources:
limits:
nvidia.com/gpu: 1 # With MPS, multiple pods can share
nvidia.com/mps.percentage: "50" # Request 50% of GPU
Security Best Practices
# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: llm-inference-policy
spec:
podSelector:
matchLabels:
app: llm-inference
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- port: 8000
egress:
- to:
- namespaceSelector: {}
ports:
- port: 443 # HTTPS for model downloads
---
# pod-security.yaml
apiVersion: v1
kind: Pod
metadata:
name: secure-inference
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: vllm
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
Complete Production Setup
# Full deployment with all best practices
kubectl apply -f - <
Conclusion
Kubernetes provides the foundation for production ML deployments at scale. Key takeaways:
- Use node selectors to target specific GPU types
- Implement proper health checks for model warm-up time
- Configure autoscaling based on custom metrics
- Use canary deployments for safe model updates
- Monitor GPU utilization and request latency
Whether you're running your own cluster or using managed infrastructure like GPUBrazil, these patterns will help you build reliable, scalable ML systems.