The GPU Operator Stack

Modern Kubernetes GPU support isn't a single component — it's a layered stack, and understanding each layer matters when something breaks at 2 AM.

┌─────────────────────────────────────────────────────────┐
│  Your ML Workload (training/inference/batch)             │
├─────────────────────────────────────────────────────────┤
│  Kubernetes Scheduler (filters + scores GPU nodes)      │
├─────────────────────────────────────────────────────────┤
│  NVIDIA Device Plugin ( advertises GPU resources )       │
│  DCGM Exporter ( exposes GPU metrics to Prometheus )     │
│  GPU Feature Discovery ( labels nodes with GPU info )  │
├─────────────────────────────────────────────────────────┤
│  NVIDIA Driver ( kernel module on each node )           │
├─────────────────────────────────────────────────────────┤
│  Underlying GPU hardware ( A100, H100, L40S, etc. )     │
└─────────────────────────────────────────────────────────┘

The NVIDIA GPU Operator manages all the Kubernetes-layer components automatically. It provisions the Device Plugin, DCGM Exporter, Driver Manager, and GPU Feature Discovery as operator-managed pods — so you don't have to track version alignment across a cluster manually.

Installing the GPU Operator

# Add the NVIDIA Helm repository
helm repo add nvdp https://nvidia.github.io/gpu-operator
helm repo update

# Create a namespace for the operator
kubectl create namespace gpu-operator

# Install the GPU Operator with Helm
helm install gpu-operator nvdp/gpu-operator \
  --namespace gpu-operator \
  --set driver.enabled=false \      # Drivers pre-installed on node
  --set toolkit.enabled=true \       # Container Toolkit for Docker/Containerd
  --set dcgmExporter.enabled=true \ # GPU metrics for Prometheus
  --set dcgmExporter.serviceMonitor.enabled=true

The driver is typically pre-installed via node OS packages (NVIDIA Driver CUDA), so you disable the driver's embedded install and just manage the Kubernetes-facing components. The Container Toolkit is critical — it injects the NVIDIA CUDA runtime into every pod that requests a GPU, so your training containers don't need to bundle their own CUDA libraries.

Verifying the Installation

# Check that the Device Plugin is running
kubectl get pods -n gpu-operator

# Query available GPU resources
kubectl describe node <gpu-node-name> | grep -A 10 "nvidia.com/gpu"

# Expected output:
# nvidia.com/gpu: 4
# nvidia.com/gpu.memory: 40Gi
# nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB

If nvidia.com/gpu shows 0 after installation, the Device Plugin isn't communicating with the GPUs. Common causes: missing kernel module (nvidia.ko), incorrect container runtime configuration, or the node was not rebooted after driver install.

Requesting GPUs in Pod Specs

GPU scheduling in Kubernetes requires explicit resource requests. There's no magic discovery — you must tell the scheduler you need GPU hardware.

apiVersion: v1
kind: Pod
metadata:
  name: pytorch-training-job
spec:
  restartPolicy: OnFailure
  containers:
  - name: training
    image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
    command: ["python", "/app/train.py"]
    resources:
      limits:
        nvidia.com/gpu: "2"      # Request 2 GPUs
        memory: "64Gi"
        cpu: "16"
      requests:
        memory: "32Gi"
        cpu: "8"
    env:
    - name: CUDA_VISIBLE_DEVICES
      value: "0,1"               # Match allocated GPUs

Critical rule: Always set CUDA_VISIBLE_DEVICES to match the number of GPUs requested. Kubernetes assigns GPUs by index, and if your code enumerates GPUs differently than the scheduler, you can end up with two containers fighting over the same GPU while a third GPU sits idle.

Multi-Instance GPU (MIG) on A100/H100

MIG allows you to slice a single physical GPU into up to 7 independent instances, each with guaranteed QoS. On an A100 40GB, a 1g.5gb MIG profile gives you 7 slices × 5GB = 35GB, with remaining headroom for system overhead.

# Check MIG mode on the node
kubectl debug node/<node-name> -it --image=nvidia/cuda:12.1.0-base-ubuntu22.04 -- nvidia-smi -L

# Enable MIG mode in the GPU Operator Helm values
helm upgrade gpu-operator nvdp/gpu-operator \
  --namespace gpu-operator \
  --set driver.enabled=false \
  --set toolkit.enabled=true \
  --set dcgmExporter.enabled=true \
  --set mig.strategy=mixed   # "single" or "mixed" — mixed lets you use both MIG and full GPU
# Pod requesting a MIG slice (1g.5gb = 1/7 of A100)
apiVersion: v1
kind: Pod
metadata:
  name: inference-mig-slice
spec:
  containers:
  - name: inference
    image: nvcr.io/nvidia/tritonserver:24.04-py3
    resources:
      limits:
        nvidia.com/mig-1g.5gb: "1"   # MIG resource name

MIG is ideal for inference serving where you want guaranteed latency SLA. For training jobs that saturate the GPU, use full GPU instances — MIG introduces scheduling overhead that cuts into training throughput.

DCGM Exporter: GPU Metrics for Prometheus

DCGM (Data Center GPU Manager) exposes hardware-level metrics — GPU utilization, memory usage, temperature, power draw, NVLink throughput, and ECC errors. Without DCGM, you see pod-level resource usage but not what's actually happening inside the GPU.

The GPU Operator installs DCGM Exporter automatically, but you can also install it standalone:

helm install dcgm-exporter nvdp/dcgm-exporter \
  --namespace gpu-operator \
  --set serviceMonitor.enabled=true \
  --set serviceMonitor.interval=15s

Key DCGM Metrics to Monitor

Metric Description Alert Threshold
DCGM_FI_DEV_GPU_UTIL GPU compute utilization % < 20% sustained = underutilized
DCGM_FI_DEV_FB_USED Frame buffer (VRAM) used MB > 90% of total = OOM risk
DCGM_FI_DEV_POWER_USAGE Current power draw Watts > 90% of TDP = thermal risk
DCGM_FI_DEV_GPU_TEMP GPU die temperature °C > 83°C sustained = throttling
DCGM_FI_DEV_ECC_SBE_DBE_TOTAL ECC error count Any uncorrectable = replace GPU
DCGM_FI_DEV_NVLINK_RX_BYTES NVLink receive throughput Low = communication bottleneck

Grafana Dashboard for GPU Nodes

A practical dashboard panels configuration for production GPU monitoring:

# grafana-gpu-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-dashboard
  namespace: monitoring
data:
  gpu-dashboard.json: |
    {
      "panels": [
        {
          "title": "GPU Utilization %",
          "type": "stat",
          "gridPos": {"h": 6, "w": 8, "x": 0, "y": 0},
          "targets": [
            {"expr": "DCGM_FI_DEV_GPU_UTIL", "legendFormat": "{{GPU}}"}
          ],
          "fieldConfig": {
            "defaults": {
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {"color": "green", "value": null},
                  {"color": "yellow", "value": 50},
                  {"color": "red", "value": 90}
                }
              }
            }
          }
        },
        {
          "title": "VRAM Usage (MB)",
          "type": "timeseries",
          "gridPos": {"h": 6, "w": 8, "x": 8, "y": 0},
          "targets": [
            {"expr": "DCGM_FI_DEV_FB_USED", "legendFormat": "{{GPU}}"},
            {"expr": "DCGM_FI_DEV_FB_FREE", "legendFormat": "{{GPU}} FREE"}
          ]
        },
        {
          "title": "Power Draw (W)",
          "type": "timeseries",
          "gridPos": {"h": 6, "w": 8, "x": 16, "y": 0},
          "targets": [
            {"expr": "DCGM_FI_DEV_POWER_USAGE", "legendFormat": "{{GPU}}"}
          ]
        }
      ]
    }

The most important metric for FinOps: GPU utilization. If your A100s are running at under 30% sustained utilization, you're wasting $15,000+/year per GPU on idle capacity. The response isn't always "buy fewer GPUs" — batching, data loading pipeline optimization, and mixed-precision training can often raise utilization significantly without buying new hardware.

GPU Scheduling Beyond Default Behavior

The Kubernetes scheduler places GPU pods on nodes with available resources, but the default behavior doesn't account for GPU memory fragmentation, NUMA topology, or multi-instance GPU slices. Understanding these nuances matters at scale.

GPU Resource Filtering

By default, Kubernetes treats GPUs as opaque resources — it knows a node has 4 GPUs but not which GPUs are partially used or whether they're on the same NUMA node as available CPU cores. For latency-sensitive inference workloads, NUMA alignment is critical: GPU-to-CPU data transfers across NUMA boundaries add 2-5μs latency, which compounds at high request rates.

# Node affinity to co-locate GPU pods with CPU resources on the same NUMA node
apiVersion: v1
kind: Pod
metadata:
  name: latency-sensitive-inference
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: topology.kubernetes.io/numa-node
            operator: In
            values:
            - "0"   # Prefer NUMA node with GPU 0
  containers:
  - name: inference
    image: nvcr.io/nvidia/tritonserver:24.04-py3
    resources:
      limits:
        nvidia.com/gpu: "1"

Time-Slicing GPUs for Multiple Small Workloads

For development and staging environments, you can time-slice a single GPU across multiple small pods. This is NOT recommended for production inference (QoS suffers) but is useful for maximizing utilization of development GPUs.

# Configure time-slicing via Device Plugin config
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4    # 4 pods can share 1 GPU (round-robin)
EOF

# Point the Device Plugin at this config
helm upgrade gpu-operator nvdp/gpu-operator \
  --namespace gpu-operator \
  --set driver.enabled=false \
  --set toolkit.enabled=true \
  --set dcgmExporter.enabled=true \
  --set config.storageClass=nvidia.com/gpu

Production Patterns: Training vs. Inference

GPU workloads fall into two fundamentally different operational categories, and your scheduling strategy should reflect that.

Training Jobs: Throughput-Optimized

Training jobs are batch-compute workloads. They saturate GPU memory with large batch sizes, run for hours to days, and the primary metric is samples-per-second throughput. Multi-GPU training uses NCCL for gradient synchronization across nodes.

# Distributed PyTorch training job with Tensor Parallelism
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: distributed-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image:registry.example.com/pytorch-training:2.3
            resources:
              limits:
                nvidia.com/gpu: "8"   # 8-GPU master node
            env:
            - name: WORLD_SIZE
              value: "8"
            - name: MASTER_ADDR
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
          nodeSelector:
            node.kubernetes.io/gpu-count: "8"
    Worker:
      replicas: 3
      template:
        spec:
          containers:
          - name: pytorch
            image: registry.example.com/pytorch-training:2.3
            resources:
              limits:
                nvidia.com/gpu: "8"

Key scheduling consideration for training: Use podantiaffinity to co-locate all training workers on the same node group, minimizing NCCL network traffic across hosts. NCCL over bandwidth-constrained inter-node links is a common training throughput bottleneck.

Inference Serving: Latency-Optimized

Inference workloads serve requests in real-time. They need low latency, not maximum throughput, and they benefit from different optimizations: continuous batching, GPU memory pre-loading for model shards, and priority scheduling.

# vLLM inference deployment with GPU memory optimization
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model=meta-llama/Llama-3-70b-instruct"
        - "--tensor-parallel-size=2"        # Split model across 2 GPUs
        - "--gpu-memory-utilization=0.90"  # Reserve 10% for KV cache
        - "--max-num-batched-tokens=8192"  # Continuous batching
        resources:
          limits:
            nvidia.com/gpu: "2"
        ports:
        - containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-inference-svc
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: 8000
  selector:
    app: vllm-inference

Node Pools and GPU Resource Management

In multi-tenant or multi-workload clusters, GPU node pools prevent expensive GPU nodes from being consumed by non-GPU pods.

# node-pool-gpu-training.yaml
apiVersion: v1
kind: NodePool
metadata:
  name: gpu-training-pool
spec:
  minNodes: 2
  maxNodes: 10
  scaleTarget:
    cpu: 80%
  nodeSelector:
    node.kubernetes.io/gpu-pool: training
  taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule   # Only GPU workloads schedule here
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  labels:
    gpu-type: A100-40GB
    gpu-count: "4"

Then add the toleration to your training job pod spec:

spec:
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
  nodeSelector:
    node.kubernetes.io/gpu-pool: training

This setup prevents a runaway data preprocessing job from consuming GPU node hours when it doesn't actually need a GPU.

FinOps: Cutting GPU Waste

GPU compute is the most expensive compute in cloud infrastructure. An idle A100 costs $1.56/hour on-demand on AWS. Here's how production teams cut that waste:

Spot Instances for Fault-Tolerant Training

Training jobs are checkpointed — they're inherently fault-tolerant. Spot instances give 60-70% discounts over on-demand.

# Karpenter provisioner for GPU spot instances
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: gpu-training-spot
spec:
  requirements:
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - g5.48xlarge    # A10G GPUs, 4x per instance
  - key: capacity.kubernetes.io/gpu-count
    operator: In
    values: ["4"]
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot"]
  limits:
    cores: 64
  ttlSecondsAfterEmpty: 300
  provider:
    spotFleet: true

Right-Sizing with DCGM Utilization Data

After 2 weeks of DCGM metrics, most clusters show GPU utilization under 50%. The response isn't always "fewer GPUs" — it's often fixing the bottleneck:

  • Low GPU util + high CPU util → data loading bottleneck, increase num_workers in DataLoader
  • Low GPU util + low CPU util → model is too small for the GPU, batch size too small
  • Low GPU util + high memory bandwidth → compute-bound kernel, try attention optimization (FlashAttention)

Troubleshooting Common GPU Operator Issues

Issue: Pods stuck in Pending with GPU resource requests

# Diagnose: check if any node advertises nvidia.com/gpu
kubectl describe nodes | grep nvidia.com/gpu

# If 0 on all nodes, the Device Plugin isn't working
kubectl logs -n gpu-operator -l app=nvidia-device-plugin
# Common fix: reboot nodes after driver install
# Or: nvidia-smi on the node returns "No devices were found"

Issue: DCGM Exporter not exposing metrics

# Check DCGM Exporter pod logs
kubectl logs -n gpu-operator -l app=dcgm-exporter

# Verify metrics endpoint
kubectl exec -n gpu-operator deploy/dcgm-exporter -- curl localhost:9400/metrics | head

# Check Prometheus scrape target
kubectl get endpoints dcgm-exporter -n gpu-operator

Issue: GPU pods OOM but VRAM shows available

This is a CUDA OOM vs. system OOM confusion. The GPU has its own memory manager. Set nvidia.com/gpu.memory requests if using MIG, and monitor DCGM_FI_DEV_FB_USED per pod via the DCGM Prometheus exporter's pod-level metrics:

# Enable pod-level GPU metrics
helm upgrade dcgm-exporter nvdp/dcgm-exporter \
  --namespace gpu-operator \
  --set podLevelResources.enabled=true

Summary

GPU operators on Kubernetes are mature enough for production, but the operational complexity is non-trivial. The key takeaways:

  1. GPU Operator manages the stack automatically — use it instead of installing components manually
  2. DCGM Exporter is non-negotiable for production — you can't FinOps what you can't measure
  3. MIG is for inference serving latency guarantees — use full GPUs for training throughput
  4. Spot instances + Karpenter cut GPU costs 60-70% for fault-tolerant training workloads
  5. NUMA alignment matters for latency-sensitive inference — don't ignore it at scale
  6. Node pools with taints prevent GPU waste on non-GPU workloads

The gap between "GPU cluster that runs" and "GPU cluster that's production-ready" is DCGM monitoring, proper node pool isolation, and right-sized resource requests. Get those three right and you'll be ahead of most teams running AI on Kubernetes.