Kubernetes GPU Scheduling for ML Workloads in 2026: A Practical Guide

Introduction

Running ML workloads on Kubernetes sounds simple until you need to schedule a multi-GPU training job across nodes, or guarantee latency for an inference endpoint while batch jobs are running in the same cluster. Kubernetes default GPU scheduling treats GPUs as opaque resources. Getting real performance out of a GPU cluster requires explicit configuration for device selection, memory management, and workload isolation.

This guide covers the practical setup: the NVIDIA device plugin, time-slicing for overload scenarios, node affinity for GPU topology, gang scheduling for distributed training, and the common failure modes that eat your GPU budget.

The NVIDIA Device Plugin

Kubernetes does not natively understand GPUs. The NVIDIA Device Plugin advertises GPU resources to the scheduler:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml

After installation, nodes advertise nvidia.com/gpu resources:

kubectl describe node | grep nvidia.com/gpu
Allocatable:
  nvidia.com/gpu: 4

Request GPUs in pod specs like any other resource:

resources:
  limits:
    nvidia.com/gpu: 2

GPU Time-Slicing for Oversubscription

When you have more workloads than GPUs, time-slicing lets multiple pods share a GPU by interleaving their compute. This is common in development clusters or for batch inference workloads.

Configure time-slicing in the device plugin config:

version: v1
sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/gpu
        replicas: 4

With replicas: 4, up to 4 pods can share each physical GPU. The NVIDIA driver handles context switching. Time-slicing works well for inference workloads with low memory footprints. For training jobs that need full GPU memory, do not oversubscribe.

Check actual GPU utilization to validate sharing:

nvidia-smi
Tue Apr 11 02:30:00 2026
+-----------------------------------------------------------------------------+
| GPU 0      Ga [Unit] A100-SXM4-80GB   Off | 00000000:00:1B.0 Off |
|  0%   Memory:  8192MiB / 81920MiB    VI | Not In Use                |
+-----------------------------------------------------------------------------+

Node Affinity for GPU Topology

Multi-GPU training performs best when tensors stay local to the node. Use nodeSelector or nodeAffinity to pin jobs to nodes with sufficient GPUs:

nodeSelector:
  node.kubernetes.io/instance-type: g4dn.4xlarge
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: nvidia.com/gpu.count
              operator: Gte
              values:
                - "4"

For NCCL-based distributed training, avoid cross-node GPU communication when possible. Place all workers on the same node for 8-GPU training. For larger jobs that span nodes, use a topology-aware placement controller or explicit topologyKey constraints.

Gang Scheduling for Distributed Training

Distributed ML training jobs require all workers to start simultaneously. Kubernetes default scheduling can deadlock: Job A gets 3 of 4 required GPUs and waits for the fourth, while Job B holds that fourth GPU waiting for a third. Neither progresses. This is the gang scheduling problem.

Use the Coscheduler (part of Kubernetes scheduling framework) or Volcano scheduler to coordinate multi-pod job placement:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distributed-training
spec:
  minAvailable: 8
  schedulerName: volcano
  tasks:
    - replicas: 8
      name: worker
      template:
        spec:
          containers:
            - name: train
              image: pytorch/pytorch:2.2.0
              resources:
                limits:
                  nvidia.com/gpu: 1
          nodeSelector:
            accelerator: nvidia-tesla-a100

With gang scheduling, the entire job is not scheduled until all 8 GPUs are available. This trades immediate scheduling for guaranteed co-location.

GPU Memory Management

Each GPU has fixed memory. Oversubscribing memory causes OOM kills that crash your pod and waste compute. Set memory limits explicitly:

resources:
  limits:
    nvidia.com/gpu: 1
    memory: "60Gi"
  requests:
    memory: "40Gi"

For transformer models, allocate based on model size: a 7B parameter model in FP16 needs roughly 14GB for weights, plus 4-8GB for activations and KV cache. Leave headroom. A 80GB A100 can comfortably run a 70B model in FP16; a 7B model fits easily on a 16GB T4.

Monitor actual memory usage per pod:

nvidia-smi --query-compute-apps=pid,used_memory --format=csv

Migrating from kube-gpu to the Device Plugin

The legacy kube-gpu scheduler was deprecated. If you have existing configurations, migrate to the NVIDIA Device Plugin + Coscheduler pattern. The new stack has better community support and works with the standard Kubernetes scheduling framework.

Migration steps:

Deploy the NVIDIA Device Plugin DaemonSet on all GPU nodes
Verify nvidia.com/gpu resources appear in node capacity
Update pod specs from alpha.kubernetes.io/nvidia-gpu to nvidia.com/gpu
Install the Coscheduler or Volcano for gang scheduling
Test with a single GPU pod before rolling out to training jobs

Cost Optimization on GPU Nodes

GPU nodes are expensive. Right-size by workload type:

Instance	GPU	Use case
g4dn.xlarge	1x T4 (16GB)	Inference, small models, dev/test
g4dn.4xlarge	1x T4 (16GB)	Batch inference, moderate throughput
p4d.24xlarge	8x A100 (40GB)	Training, 7-13B models
p5.48xlarge	8x H100 (80GB)	Large model training, RLHF

Use Spot instances for batch training jobs that can checkpoint and resume. Configure pod disruption budgets so training jobs can restart on preemption without losing too much progress.

Monitoring GPU Workloads

DCGM (Data Center GPU Manager) exporter gives you Prometheus metrics for GPU utilization, memory, temperature, and power:

helm install dcgm-exporter nicl/dcgm-exporter \
  --set serviceMonitor.enabled=true \
  --namespace gpu-operator

Key metrics to track:

DCGM_FI_DEV_GPU_UTIL: GPU compute utilization percentage
DCGM_FI_DEV_FB_USED: Frame buffer memory used
DCGM_FI_DEV_GPU_TEMP: GPU temperature in Celsius
DCGM_FI_DEV_POWER_USAGE: Current power draw in watts

Alert when GPU utilization drops below 30% for sustained periods — it means your workload is memory-bound or I/O-bound, not compute-bound, and you are wasting expensive hardware.

Conclusion

Kubernetes GPU scheduling requires explicit configuration to get real performance. The core stack is the NVIDIA Device Plugin for resource advertising, time-slicing for inference oversubscription, gang scheduling for distributed training, and DCGM for observability. Node affinity keeps multi-GPU jobs local to the node for NCCL performance. Cost optimization comes from matching instance types to workload shapes: T4 for inference, A100 for training, H100 for large-scale RLHF.