GPU Monitoring for AI Inference: A Practical Guide for 2026

Why GPU Monitoring Is Non-Negotiable for AI Inference

When you are running AI inference at scale, the GPU is your most expensive asset — and your biggest source of waste. A single H100 GPU costs $30,000-$40,000 to purchase or $2.50-$3.00/hour on-demand in the cloud. If that GPU sits at 15% utilization, you are burning money at $2.10/hour while delivering a fraction of the throughput you paid for.

The numbers are stark. Based on production data from StackPulse readers running inference workloads:

Average GPU utilization across production AI inference deployments: 23-35% — meaning 65-77% of paid GPU time is wasted
Median GPU memory utilization: 45-60% — memory capacity is often over-provisioned relative to actual VRAM needs
Batch sizes routinely set too low — most inference servers default to conservative batch sizes that underfill GPU memory

The root cause is simple: most teams instrument model-level metrics (latency, throughput, error rates) but never instrument the GPU itself. Without GPU metrics, you cannot identify whether your inference server is GPU-bound, memory-bound, or CPU-bound. You are flying blind on your most expensive resource.

This guide covers the full GPU monitoring stack — from raw hardware metrics to Kubernetes-native GPU scheduling to cost optimization. By the end, you will have a monitoring setup that lets you answer: "Are we using our GPUs efficiently, and if not, why not?"

The GPU Metrics That Actually Matter

Not all GPU metrics are equally important. Here is what to track, ordered by diagnostic value.

Utilization Metrics

GPU Compute Utilization — percentage of time the GPU is actively executing compute operations. This is your primary signal for compute-bound workloads (matrix multiplications, transformer layers). A value below 50% on a consistently busy inference server usually indicates:

CPU bottleneck (data loading, preprocessing)
Memory bandwidth bottleneck (access patterns)
Poor batching (not enough parallelism per SM)

GPU Memory Utilization — percentage of VRAM currently allocated. Critical for memory-bound workloads. If this is at 95%+ while compute utilization is low, you have a memory bottleneck — consider reducing model precision (FP16 vs BF16), enabling quantization, or using gradient checkpointing.

SM (Streaming Multiprocessor) Activity — percentage of SMs that are active. On some workloads (e.g., small batch sizes), not all SMs are engaged. Low SM activity with high memory utilization suggests the model is not parallelism-optimized for the GPU architecture.

Memory Metrics

VRAM Used / VRAM Total — absolute memory consumption. Track this alongside model size and batch size to understand memory headroom. If you have a 70B parameter model in FP16 (140GB), you need at minimum 2x A100 80GB cards with tensor parallelism.

Memory Bandwidth Utilization — percentage of theoretical memory bandwidth being used. On transformers, this is often the bottleneck, not compute. H100 has 3.35 TB/s memory bandwidth; if you are only using 1 TB/s, you are likely memory-bandwidth bound.

PCIe Bandwidth — for multi-GPU setups, monitor PCIe traffic to detect cross-GPU communication bottlenecks. NVLink eliminates much of this, but monitoring helps validate topology.

Hardware Health Metrics

GPU Temperature — degrees Celsius. GPUs max out at 83°C (A100) or 87°C (H100) before thermal throttling kicks in, reducing clock speeds by 10-20% and hurting throughput.

Power Draw — watts consumed. Useful for cost attribution and capacity planning. H100 draws up to 700W under full load. Power draw vs utilization tells you efficiency: a GPU at 30% utilization drawing 400W is wasting 250W.

ECC Errors (Correctable and Uncorrectable) — GPU memory errors. Uncorrectable ECC errors cause crashes. Monitor these especially on aging hardware.

Encoder/Decoder Utilization

For multimodal models (vision encoders, audio codecs), track NVENC (NVIDIA Encoder) and NVDEC (NVIDIA Decoder) utilization if your inference pipeline includes video or image encoding/decoding.

NVIDIA GPU Architecture for ML Engineers

Understanding GPU architecture helps you interpret metrics correctly. The key concepts for 2026 inference deployments:

A100 vs H100 vs H200 vs B200

GPU	VRAM	Memory BW	FP16 Compute	TDP	Best For
A100 80GB	80GB HBM2e	2TB/s	312 TFLOPS	400W	General inference, older deployments
H100 SXM5	80GB HBM3	3.35TB/s	989 TFLOPS	700W	Current-gen LLM inference
H200	141GB HBM3e	4.8TB/s	1,979 TFLOPS	800W	Large model inference (70B+), RAG
B200	192GB HBM3e	8TB/s	2,500 TFLOPS	1,000W	Largest models, Blackwell architecture

For most AI inference workloads in 2026, H100 SXM5 or H200 are the standard. The jump from A100 to H100 is primarily a memory bandwidth increase (2TB/s to 3.35TB/s), which matters enormously for transformer models that are memory-bandwidth bound.

MIG (Multi-Instance GPU)

MIG allows you to partition a single physical GPU into up to 7 independent instances, each with dedicated VRAM and compute resources. This is critical for cost optimization — instead of running one small model on an expensive GPU and wasting 70% of resources, you partition the GPU and run 3-4 models simultaneously.

# Partition an A100 80GB into 4 equal MIG instances
nvidia-smi mig -lgip
nvidia-smi mig -cgi 19,19,19,19 -C

Each MIG instance appears as a separate GPU to CUDA — your inference server just sees 4 GPUs. Monitor each MIG instance independently via its device index.

MIG monitoring caveat: Not all metrics are visible at the MIG partition level. Compute utilization and power are per-instance, but memory bandwidth is shared. Use MIG carefully and validate that your monitoring stack supports MIG-level granularity (DCGM does; some Prometheus exporters do not).

vGPU (Virtual GPU)

vGPU is NVIDIA's virtualization technology for cloud instances and virtual desktops. Cloud providers like AWS (p4d.24xlarge, p5.48xlarge), Google Cloud (A2-HighGPU), and Azure (ND A100 v4) use vGPU to expose fractional GPU resources to VMs.

vGPU monitoring works through the NVIDIA vGPU Manager — physical GPU metrics are aggregated across all vGPU instances running on that physical card. Monitoring vGPU performance requires DCGM with vGPU support and per-vGPU metrics via the vGPU Manager API.

Monitoring Stack: The Open-Source Toolkit

DCGM (Data Center GPU Manager)

DCGM is NVIDIA's official monitoring tool for data center GPUs. It provides GPU health monitoring, performance metrics, and telemetry export to Prometheus — with no instrumentation required in your application code.

The fastest way to get DCGM metrics into Prometheus:

# Pull the DCGM exporter as a DaemonSet
helm install dcgm-exporter nicl/dcgm-exporter \
  --namespace monitoring \
  --create-namespace

# Verify metrics are flowing
kubectl exec -n monitoring deploy/dcgm-exporter -- \
  curl localhost:9400/metrics | grep DCGM_FI

Key DCGM metrics:

DCGM_FI_DEV_GPU_UTIL — GPU utilization %
DCGM_FI_DEV_MEM_COPY_UTIL — memory bandwidth utilization %
DCGM_FI_DEV_FB_USED — frame buffer (VRAM) used in MB
DCGM_FI_DEV_FB_FREE — frame buffer free in MB
DCGM_FI_DEV_POWER_USAGE — current power draw in watts
DCGM_FI_DEV_GPU_TEMP — temperature in Celsius
DCGM_FI_DEV_XID_ERRORS — Xid error count (GPU hardware errors)

nvidia-smi (NVIDIA System Management Interface)

nvidia-smi is the CLI tool available on every NVIDIA GPU node. For one-off diagnostics and quick checks:

# Live monitoring
watch -n 1 nvidia-smi

# Query specific metrics
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw \
  --format=csv

# Check for Xid errors (GPU hardware errors)
nvidia-smi -q -x | grep xid

node-exporter + textfile collector

For Kubernetes nodes, the standard Prometheus node-exporter runs as a DaemonSet for system-level metrics. For GPU-specific monitoring, use DCGM exporter as the primary method — it handles all GPU metrics natively without requiring custom shell scripts or textfile collectors. The Helm deployment above (dcgm-exporter) is the recommended approach for Kubernetes clusters.

Kubernetes GPU Scheduling and Monitoring

NVIDIA Device Plugin

The NVIDIA device plugin for Kubernetes exposes GPU resources to pods. Install via Helm:

helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace

Once installed, request GPUs in pod specs:

resources:
  limits:
    nvidia.com/gpu: 2  # request 2 GPUs

Monitoring GPUs in Kubernetes

For Kubernetes-native GPU monitoring, use DCGM exporter as a DaemonSet with the Prometheus Operator. The DCGM exporter connects to the NVIDIA device plugin and exposes GPU metrics on port 9400 for Prometheus scraping.

Critical Kubernetes labels for GPU nodes (discovered by GPU Feature Discovery):

nvidia.com/gpu.product — GPU model (e.g., A100-SXM4-80GB)
nvidia.com/gpu.memory — total VRAM
topology.kubernetes.io/region — for cross-region capacity planning

Topology-Aware Scheduling

On multi-GPU nodes (8-GPU servers are common), GPU-to-GPU communication topology matters enormously for distributed inference. On an 8-GPU A100 node:

NVLink: intra-node GPU-to-GPU bandwidth is ~600 GB/s (vs 32 GB/s PCIe)
NCCL: handles multi-GPU collective communication

For production multi-GPU inference, validate that your orchestrator is placing pods on GPUs with NVLink connectivity. Tools like the Kubernetes topology-manager and custom schedulers (Volcano, YuniKorn) can enforce topology-aware placement.

# Check NVLink status
nvidia-smi topo -m

Multi-GPU and Distributed Inference Monitoring

Running distributed inference across multiple GPUs (tensor parallelism, pipeline parallelism) introduces monitoring complexity that single-GPU setups do not have.

NCCL Topology Monitoring

NCCL is used by all major inference servers (vLLM, TensorRT-LLM, SGLang) for multi-GPU communication. NCCL performance depends heavily on topology:

Intra-node NVLink: ~600 GB/s bidirectional bandwidth per link
Inter-node InfiniBand HDR: ~400 Gbps — 12x slower than NVLink

Monitor for NCCL communication time vs compute time. If NCCL AllReduce is taking 30%+ of your iteration time, you have a communication bottleneck.

Tensor Parallelism Health

In tensor parallelism, model weights are sharded across GPUs, requiring constant AllReduce operations. Monitor:

Per-GPU memory utilization (should be roughly equal if sharding is balanced)
AllReduce latency per layer (indicates communication bottlenecks)
GPU-to-GPU bandwidth utilization via NVLink

If GPU 0 has significantly higher memory utilization than GPU 1-7 in a tensor-parallel setup, you have an unbalanced shard — performance will be bottlenecked by the straggler GPU.

Cost Optimization: Maximize GPU Utilization

GPU monitoring only delivers value if you act on the data. The most impactful optimizations:

1. Dynamic Batching Based on GPU Metrics

Most inference servers use static batch sizes. A smarter approach uses GPU memory utilization as the signal for batch sizing:

import pynvml

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)

def get_optimal_batch_size(current_batch: int, target_memory_util: float = 0.80) -> int:
    """Adjust batch size based on GPU memory utilization."""
    mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
    memory_util = mem_info.used / mem_info.total

    if memory_util < target_memory_util - 0.10:
        return min(current_batch + 2, 64)   # Scale up
    elif memory_util > target_memory_util + 0.05:
        return max(current_batch - 1, 1)    # Scale down
    return current_batch

This is a simplified version — production implementations (like vLLM's automatic batch scheduling) do more sophisticated memory management. The key principle: use GPU memory headroom as the signal, not just queue depth.

2. MIG Partitioning for Cost Efficiency

If you are running multiple small models, MIG partitioning can cut GPU costs by 3-4x by sharing a single physical GPU across multiple inference workloads:

# Kubernetes: request a MIG instance
resources:
  limits:
    nvidia.com/gpu.mig-19g-1gb: 1  # 19GB MIG slice

Monitor each MIG slice independently to ensure they are not oversubscribed.

3. GPU-Utilization-Based Autoscaling

Scale inference replicas based on GPU utilization, not just request queue length:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-server-gpu-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: nvidia.com/gpu
      target:
        type: Utilization
        averageUtilization: 75  # Scale when GPU util > 75%

Alerting: What to Watch For

Set up these alerts for GPU health and efficiency:

Critical (PagerDuty-level)

GPU Xid Error: DCGM_FI_DEV_XID_ERRORS > 0 — indicates GPU hardware error, will cause crash
GPU Temperature > 83C (A100) / 87C (H100) — thermal throttling active, throughput degraded
Uncorrectable ECC Error — GPU memory failing, leads to crashes
GPU Power Draw > 95% TDP sustained — compute-saturated, monitor for throttling

Warning (Slack-level)

GPU Utilization < 25% sustained — GPU is idle, investigate batching or scale workloads
GPU Memory Utilization > 90% — risk of OOM, consider reducing batch size
GPU Memory Utilization < 30% — memory over-provisioned, can fit larger batch or smaller GPU
Power Efficiency < 30%: gpu_utilization / power_draw ratio is poor — batch size too low

Prometheus Queries for GPU Dashboards

# GPU Utilization (%)
avg(rate(DCGM_FI_DEV_GPU_UTIL{exported_pod=~".*"}[5m]))

# GPU Memory Utilization (%)
avg(rate(DCGM_FI_DEV_MEM_COPY_UTIL{exported_pod=~".*"}[5m]))

# VRAM Used vs Total
DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED)

# Average GPU Temperature
avg(DCGM_FI_DEV_GPU_TEMP)

# GPU Power Draw (watts)
DCGM_FI_DEV_POWER_USAGE

# Per-pod GPU utilization
avg by (exported_pod) (rate(DCGM_FI_DEV_GPU_UTIL[5m]))

# Alert: GPU thermal throttling
DCGM_FI_DEV_GPU_TEMP > 83

Grafana Dashboard Templates

For a ready-made GPU dashboard, import the "NVIDIA DCGM Exporter" dashboard from Grafana Labs (Dashboard ID: 12239). It covers GPU utilization over time, memory utilization, temperature and power, ECC errors and Xid errors, and GPU clock speeds.

For Kubernetes-native GPU monitoring, use the "NVIDIA GPU Dashboard for Kubernetes" (Grafana Labs, Dashboard ID: 13770) which correlates GPU metrics with Kubernetes pod assignments.

If you prefer to build your own, start with these panels:

GPU Utilization (time series, per-GPU): DCGM_FI_DEV_GPU_UTIL
VRAM Pressure (gauge, %): DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED) * 100
Temperature (time series, with alert threshold line): DCGM_FI_DEV_GPU_TEMP
Power Draw (time series, watts): DCGM_FI_DEV_POWER_USAGE
Xid Errors (stat panel, count): DCGM_FI_DEV_XID_ERRORS

The Bottom Line

GPU monitoring is the foundation of inference efficiency. Without it, you are guessing whether your GPU budget is well-spent. The key metrics to track continuously:

GPU Utilization — is the GPU doing work or sitting idle?
VRAM Utilization — are you fitting the right batch size?
Temperature — thermal throttling silently kills performance
Power Efficiency — watts consumed per token generated
NCCL Communication (multi-GPU) — is cross-GPU communication bottlenecking throughput?

The monitoring stack is straightforward: DCGM + Prometheus + Grafana gives you full observability with open-source tooling. The ROI is concrete — every 10% improvement in GPU utilization on an H100 paying $3/hour in cloud costs saves $0.30/hour per GPU, or $2,628/GPU/year.

Recommended Tool CoreWeave

Cloud GPU infrastructure built for AI inference — H100, H200, and B200 instances with NVLink and Kubernetes-native scheduling. From $2.23/GPU/hour.

Recommended Tool Lambda Labs

Single and multi-GPU cloud instances for AI inference — RTX 4090, A100, and H100 with persistent storage. From $0.50/hr for A100.