Kubernetes GPU Scheduling for ML Workloads in 2026: A Practical Guide
Schedule GPUs in Kubernetes for ML training and inference. Covers time-slicing, node pooling, gang scheduling, device plugins, and the k8s-gpu-operator setup for production ML workloads.
Introduction
Running ML workloads on Kubernetes sounds simple until you need to schedule a multi-GPU training job across nodes, or guarantee latency for an inference endpoint while batch jobs are running in the same cluster. Kubernetes default GPU scheduling treats GPUs as opaque resources. Getting real performance out of a GPU cluster requires explicit configuration for device selection, memory management, and workload isolation.
This guide covers the practical setup: the NVIDIA device plugin, time-slicing for overload scenarios, node affinity for GPU topology, gang scheduling for distributed training, and the common failure modes that eat your GPU budget.
The NVIDIA Device Plugin
Kubernetes does not natively understand GPUs. The NVIDIA Device Plugin advertises GPU resources to the scheduler:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml After installation, nodes advertise nvidia.com/gpu resources:
kubectl describe node | grep nvidia.com/gpu
Allocatable:
nvidia.com/gpu: 4 Request GPUs in pod specs like any other resource:
resources:
limits:
nvidia.com/gpu: 2 GPU Time-Slicing for Oversubscription
When you have more workloads than GPUs, time-slicing lets multiple pods share a GPU by interleaving their compute. This is common in development clusters or for batch inference workloads.
Configure time-slicing in the device plugin config:
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 With replicas: 4, up to 4 pods can share each physical GPU. The NVIDIA driver handles context switching. Time-slicing works well for inference workloads with low memory footprints. For training jobs that need full GPU memory, do not oversubscribe.
Check actual GPU utilization to validate sharing:
nvidia-smi
Tue Apr 11 02:30:00 2026
+-----------------------------------------------------------------------------+
| GPU 0 Ga [Unit] A100-SXM4-80GB Off | 00000000:00:1B.0 Off |
| 0% Memory: 8192MiB / 81920MiB VI | Not In Use |
+-----------------------------------------------------------------------------+ Node Affinity for GPU Topology
Multi-GPU training performs best when tensors stay local to the node. Use nodeSelector or nodeAffinity to pin jobs to nodes with sufficient GPUs:
nodeSelector:
node.kubernetes.io/instance-type: g4dn.4xlarge
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.count
operator: Gte
values:
- "4" For NCCL-based distributed training, avoid cross-node GPU communication when possible. Place all workers on the same node for 8-GPU training. For larger jobs that span nodes, use a topology-aware placement controller or explicit topologyKey constraints.
Gang Scheduling for Distributed Training
Distributed ML training jobs require all workers to start simultaneously. Kubernetes default scheduling can deadlock: Job A gets 3 of 4 required GPUs and waits for the fourth, while Job B holds that fourth GPU waiting for a third. Neither progresses. This is the gang scheduling problem.
Use the Coscheduler (part of Kubernetes scheduling framework) or Volcano scheduler to coordinate multi-pod job placement:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: distributed-training
spec:
minAvailable: 8
schedulerName: volcano
tasks:
- replicas: 8
name: worker
template:
spec:
containers:
- name: train
image: pytorch/pytorch:2.2.0
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-a100 With gang scheduling, the entire job is not scheduled until all 8 GPUs are available. This trades immediate scheduling for guaranteed co-location.
GPU Memory Management
Each GPU has fixed memory. Oversubscribing memory causes OOM kills that crash your pod and waste compute. Set memory limits explicitly:
resources:
limits:
nvidia.com/gpu: 1
memory: "60Gi"
requests:
memory: "40Gi" For transformer models, allocate based on model size: a 7B parameter model in FP16 needs roughly 14GB for weights, plus 4-8GB for activations and KV cache. Leave headroom. A 80GB A100 can comfortably run a 70B model in FP16; a 7B model fits easily on a 16GB T4.
Monitor actual memory usage per pod:
nvidia-smi --query-compute-apps=pid,used_memory --format=csv Migrating from kube-gpu to the Device Plugin
The legacy kube-gpu scheduler was deprecated. If you have existing configurations, migrate to the NVIDIA Device Plugin + Coscheduler pattern. The new stack has better community support and works with the standard Kubernetes scheduling framework.
Migration steps:
- Deploy the NVIDIA Device Plugin DaemonSet on all GPU nodes
- Verify
nvidia.com/gpuresources appear in node capacity - Update pod specs from
alpha.kubernetes.io/nvidia-gputonvidia.com/gpu - Install the Coscheduler or Volcano for gang scheduling
- Test with a single GPU pod before rolling out to training jobs
Cost Optimization on GPU Nodes
GPU nodes are expensive. Right-size by workload type:
| Instance | GPU | Use case |
|---|---|---|
| g4dn.xlarge | 1x T4 (16GB) | Inference, small models, dev/test |
| g4dn.4xlarge | 1x T4 (16GB) | Batch inference, moderate throughput |
| p4d.24xlarge | 8x A100 (40GB) | Training, 7-13B models |
| p5.48xlarge | 8x H100 (80GB) | Large model training, RLHF |
Use Spot instances for batch training jobs that can checkpoint and resume. Configure pod disruption budgets so training jobs can restart on preemption without losing too much progress.
Monitoring GPU Workloads
DCGM (Data Center GPU Manager) exporter gives you Prometheus metrics for GPU utilization, memory, temperature, and power:
helm install dcgm-exporter nicl/dcgm-exporter \
--set serviceMonitor.enabled=true \
--namespace gpu-operator Key metrics to track:
DCGM_FI_DEV_GPU_UTIL: GPU compute utilization percentageDCGM_FI_DEV_FB_USED: Frame buffer memory usedDCGM_FI_DEV_GPU_TEMP: GPU temperature in CelsiusDCGM_FI_DEV_POWER_USAGE: Current power draw in watts
Alert when GPU utilization drops below 30% for sustained periods — it means your workload is memory-bound or I/O-bound, not compute-bound, and you are wasting expensive hardware.
Conclusion
Kubernetes GPU scheduling requires explicit configuration to get real performance. The core stack is the NVIDIA Device Plugin for resource advertising, time-slicing for inference oversubscription, gang scheduling for distributed training, and DCGM for observability. Node affinity keeps multi-GPU jobs local to the node for NCCL performance. Cost optimization comes from matching instance types to workload shapes: T4 for inference, A100 for training, H100 for large-scale RLHF.