Back to Blog
AI Infrastructure 12 min read

Kubernetes GPU Scheduling for ML Workloads in 2026: A Practical Guide

Schedule GPUs in Kubernetes for ML training and inference. Covers time-slicing, node pooling, gang scheduling, device plugins, and the k8s-gpu-operator setup for production ML workloads.

Introduction

Running ML workloads on Kubernetes sounds simple until you need to schedule a multi-GPU training job across nodes, or guarantee latency for an inference endpoint while batch jobs are running in the same cluster. Kubernetes default GPU scheduling treats GPUs as opaque resources. Getting real performance out of a GPU cluster requires explicit configuration for device selection, memory management, and workload isolation.

This guide covers the practical setup: the NVIDIA device plugin, time-slicing for overload scenarios, node affinity for GPU topology, gang scheduling for distributed training, and the common failure modes that eat your GPU budget.

The NVIDIA Device Plugin

Kubernetes does not natively understand GPUs. The NVIDIA Device Plugin advertises GPU resources to the scheduler:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml

After installation, nodes advertise nvidia.com/gpu resources:

kubectl describe node | grep nvidia.com/gpu
Allocatable:
  nvidia.com/gpu: 4

Request GPUs in pod specs like any other resource:

resources:
  limits:
    nvidia.com/gpu: 2

GPU Time-Slicing for Oversubscription

When you have more workloads than GPUs, time-slicing lets multiple pods share a GPU by interleaving their compute. This is common in development clusters or for batch inference workloads.

Configure time-slicing in the device plugin config:

version: v1
sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/gpu
        replicas: 4

With replicas: 4, up to 4 pods can share each physical GPU. The NVIDIA driver handles context switching. Time-slicing works well for inference workloads with low memory footprints. For training jobs that need full GPU memory, do not oversubscribe.

Check actual GPU utilization to validate sharing:

nvidia-smi
Tue Apr 11 02:30:00 2026
+-----------------------------------------------------------------------------+
| GPU 0      Ga [Unit] A100-SXM4-80GB   Off | 00000000:00:1B.0 Off |
|  0%   Memory:  8192MiB / 81920MiB    VI | Not In Use                |
+-----------------------------------------------------------------------------+

Node Affinity for GPU Topology

Multi-GPU training performs best when tensors stay local to the node. Use nodeSelector or nodeAffinity to pin jobs to nodes with sufficient GPUs:

nodeSelector:
  node.kubernetes.io/instance-type: g4dn.4xlarge
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: nvidia.com/gpu.count
              operator: Gte
              values:
                - "4"

For NCCL-based distributed training, avoid cross-node GPU communication when possible. Place all workers on the same node for 8-GPU training. For larger jobs that span nodes, use a topology-aware placement controller or explicit topologyKey constraints.

Gang Scheduling for Distributed Training

Distributed ML training jobs require all workers to start simultaneously. Kubernetes default scheduling can deadlock: Job A gets 3 of 4 required GPUs and waits for the fourth, while Job B holds that fourth GPU waiting for a third. Neither progresses. This is the gang scheduling problem.

Use the Coscheduler (part of Kubernetes scheduling framework) or Volcano scheduler to coordinate multi-pod job placement:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distributed-training
spec:
  minAvailable: 8
  schedulerName: volcano
  tasks:
    - replicas: 8
      name: worker
      template:
        spec:
          containers:
            - name: train
              image: pytorch/pytorch:2.2.0
              resources:
                limits:
                  nvidia.com/gpu: 1
          nodeSelector:
            accelerator: nvidia-tesla-a100

With gang scheduling, the entire job is not scheduled until all 8 GPUs are available. This trades immediate scheduling for guaranteed co-location.

GPU Memory Management

Each GPU has fixed memory. Oversubscribing memory causes OOM kills that crash your pod and waste compute. Set memory limits explicitly:

resources:
  limits:
    nvidia.com/gpu: 1
    memory: "60Gi"
  requests:
    memory: "40Gi"

For transformer models, allocate based on model size: a 7B parameter model in FP16 needs roughly 14GB for weights, plus 4-8GB for activations and KV cache. Leave headroom. A 80GB A100 can comfortably run a 70B model in FP16; a 7B model fits easily on a 16GB T4.

Monitor actual memory usage per pod:

nvidia-smi --query-compute-apps=pid,used_memory --format=csv

Migrating from kube-gpu to the Device Plugin

The legacy kube-gpu scheduler was deprecated. If you have existing configurations, migrate to the NVIDIA Device Plugin + Coscheduler pattern. The new stack has better community support and works with the standard Kubernetes scheduling framework.

Migration steps:

  1. Deploy the NVIDIA Device Plugin DaemonSet on all GPU nodes
  2. Verify nvidia.com/gpu resources appear in node capacity
  3. Update pod specs from alpha.kubernetes.io/nvidia-gpu to nvidia.com/gpu
  4. Install the Coscheduler or Volcano for gang scheduling
  5. Test with a single GPU pod before rolling out to training jobs

Cost Optimization on GPU Nodes

GPU nodes are expensive. Right-size by workload type:

InstanceGPUUse case
g4dn.xlarge1x T4 (16GB)Inference, small models, dev/test
g4dn.4xlarge1x T4 (16GB)Batch inference, moderate throughput
p4d.24xlarge8x A100 (40GB)Training, 7-13B models
p5.48xlarge8x H100 (80GB)Large model training, RLHF

Use Spot instances for batch training jobs that can checkpoint and resume. Configure pod disruption budgets so training jobs can restart on preemption without losing too much progress.

Monitoring GPU Workloads

DCGM (Data Center GPU Manager) exporter gives you Prometheus metrics for GPU utilization, memory, temperature, and power:

helm install dcgm-exporter nicl/dcgm-exporter \
  --set serviceMonitor.enabled=true \
  --namespace gpu-operator

Key metrics to track:

  • DCGM_FI_DEV_GPU_UTIL: GPU compute utilization percentage
  • DCGM_FI_DEV_FB_USED: Frame buffer memory used
  • DCGM_FI_DEV_GPU_TEMP: GPU temperature in Celsius
  • DCGM_FI_DEV_POWER_USAGE: Current power draw in watts

Alert when GPU utilization drops below 30% for sustained periods — it means your workload is memory-bound or I/O-bound, not compute-bound, and you are wasting expensive hardware.

Conclusion

Kubernetes GPU scheduling requires explicit configuration to get real performance. The core stack is the NVIDIA Device Plugin for resource advertising, time-slicing for inference oversubscription, gang scheduling for distributed training, and DCGM for observability. Node affinity keeps multi-GPU jobs local to the node for NCCL performance. Cost optimization comes from matching instance types to workload shapes: T4 for inference, A100 for training, H100 for large-scale RLHF.