Why GPU Monitoring Is Non-Negotiable for AI Inference
When you are running AI inference at scale, the GPU is your most expensive asset — and your biggest source of waste. A single H100 GPU costs $30,000-$40,000 to purchase or $2.50-$3.00/hour on-demand in the cloud. If that GPU sits at 15% utilization, you are burning money at $2.10/hour while delivering a fraction of the throughput you paid for.
The numbers are stark. Based on production data from StackPulse readers running inference workloads:
- Average GPU utilization across production AI inference deployments: 23-35% — meaning 65-77% of paid GPU time is wasted
- Median GPU memory utilization: 45-60% — memory capacity is often over-provisioned relative to actual VRAM needs
- Batch sizes routinely set too low — most inference servers default to conservative batch sizes that underfill GPU memory
The root cause is simple: most teams instrument model-level metrics (latency, throughput, error rates) but never instrument the GPU itself. Without GPU metrics, you cannot identify whether your inference server is GPU-bound, memory-bound, or CPU-bound. You are flying blind on your most expensive resource.
This guide covers the full GPU monitoring stack — from raw hardware metrics to Kubernetes-native GPU scheduling to cost optimization. By the end, you will have a monitoring setup that lets you answer: "Are we using our GPUs efficiently, and if not, why not?"
The GPU Metrics That Actually Matter
Not all GPU metrics are equally important. Here is what to track, ordered by diagnostic value.
Utilization Metrics
GPU Compute Utilization — percentage of time the GPU is actively executing compute operations. This is your primary signal for compute-bound workloads (matrix multiplications, transformer layers). A value below 50% on a consistently busy inference server usually indicates:
- CPU bottleneck (data loading, preprocessing)
- Memory bandwidth bottleneck (access patterns)
- Poor batching (not enough parallelism per SM)
GPU Memory Utilization — percentage of VRAM currently allocated. Critical for memory-bound workloads. If this is at 95%+ while compute utilization is low, you have a memory bottleneck — consider reducing model precision (FP16 vs BF16), enabling quantization, or using gradient checkpointing.
SM (Streaming Multiprocessor) Activity — percentage of SMs that are active. On some workloads (e.g., small batch sizes), not all SMs are engaged. Low SM activity with high memory utilization suggests the model is not parallelism-optimized for the GPU architecture.
Memory Metrics
VRAM Used / VRAM Total — absolute memory consumption. Track this alongside model size and batch size to understand memory headroom. If you have a 70B parameter model in FP16 (140GB), you need at minimum 2x A100 80GB cards with tensor parallelism.
Memory Bandwidth Utilization — percentage of theoretical memory bandwidth being used. On transformers, this is often the bottleneck, not compute. H100 has 3.35 TB/s memory bandwidth; if you are only using 1 TB/s, you are likely memory-bandwidth bound.
PCIe Bandwidth — for multi-GPU setups, monitor PCIe traffic to detect cross-GPU communication bottlenecks. NVLink eliminates much of this, but monitoring helps validate topology.
Hardware Health Metrics
GPU Temperature — degrees Celsius. GPUs max out at 83°C (A100) or 87°C (H100) before thermal throttling kicks in, reducing clock speeds by 10-20% and hurting throughput.
Power Draw — watts consumed. Useful for cost attribution and capacity planning. H100 draws up to 700W under full load. Power draw vs utilization tells you efficiency: a GPU at 30% utilization drawing 400W is wasting 250W.
ECC Errors (Correctable and Uncorrectable) — GPU memory errors. Uncorrectable ECC errors cause crashes. Monitor these especially on aging hardware.
Encoder/Decoder Utilization
For multimodal models (vision encoders, audio codecs), track NVENC (NVIDIA Encoder) and NVDEC (NVIDIA Decoder) utilization if your inference pipeline includes video or image encoding/decoding.
NVIDIA GPU Architecture for ML Engineers
Understanding GPU architecture helps you interpret metrics correctly. The key concepts for 2026 inference deployments:
A100 vs H100 vs H200 vs B200
| GPU | VRAM | Memory BW | FP16 Compute | TDP | Best For |
|---|---|---|---|---|---|
| A100 80GB | 80GB HBM2e | 2TB/s | 312 TFLOPS | 400W | General inference, older deployments |
| H100 SXM5 | 80GB HBM3 | 3.35TB/s | 989 TFLOPS | 700W | Current-gen LLM inference |
| H200 | 141GB HBM3e | 4.8TB/s | 1,979 TFLOPS | 800W | Large model inference (70B+), RAG |
| B200 | 192GB HBM3e | 8TB/s | 2,500 TFLOPS | 1,000W | Largest models, Blackwell architecture |
For most AI inference workloads in 2026, H100 SXM5 or H200 are the standard. The jump from A100 to H100 is primarily a memory bandwidth increase (2TB/s to 3.35TB/s), which matters enormously for transformer models that are memory-bandwidth bound.
MIG (Multi-Instance GPU)
MIG allows you to partition a single physical GPU into up to 7 independent instances, each with dedicated VRAM and compute resources. This is critical for cost optimization — instead of running one small model on an expensive GPU and wasting 70% of resources, you partition the GPU and run 3-4 models simultaneously.
# Partition an A100 80GB into 4 equal MIG instances
nvidia-smi mig -lgip
nvidia-smi mig -cgi 19,19,19,19 -C Each MIG instance appears as a separate GPU to CUDA — your inference server just sees 4 GPUs. Monitor each MIG instance independently via its device index.
MIG monitoring caveat: Not all metrics are visible at the MIG partition level. Compute utilization and power are per-instance, but memory bandwidth is shared. Use MIG carefully and validate that your monitoring stack supports MIG-level granularity (DCGM does; some Prometheus exporters do not).
vGPU (Virtual GPU)
vGPU is NVIDIA's virtualization technology for cloud instances and virtual desktops. Cloud providers like AWS (p4d.24xlarge, p5.48xlarge), Google Cloud (A2-HighGPU), and Azure (ND A100 v4) use vGPU to expose fractional GPU resources to VMs.
vGPU monitoring works through the NVIDIA vGPU Manager — physical GPU metrics are aggregated across all vGPU instances running on that physical card. Monitoring vGPU performance requires DCGM with vGPU support and per-vGPU metrics via the vGPU Manager API.
Monitoring Stack: The Open-Source Toolkit
DCGM (Data Center GPU Manager)
DCGM is NVIDIA's official monitoring tool for data center GPUs. It provides GPU health monitoring, performance metrics, and telemetry export to Prometheus — with no instrumentation required in your application code.
The fastest way to get DCGM metrics into Prometheus:
# Pull the DCGM exporter as a DaemonSet
helm install dcgm-exporter nicl/dcgm-exporter \
--namespace monitoring \
--create-namespace
# Verify metrics are flowing
kubectl exec -n monitoring deploy/dcgm-exporter -- \
curl localhost:9400/metrics | grep DCGM_FI Key DCGM metrics:
DCGM_FI_DEV_GPU_UTIL— GPU utilization %DCGM_FI_DEV_MEM_COPY_UTIL— memory bandwidth utilization %DCGM_FI_DEV_FB_USED— frame buffer (VRAM) used in MBDCGM_FI_DEV_FB_FREE— frame buffer free in MBDCGM_FI_DEV_POWER_USAGE— current power draw in wattsDCGM_FI_DEV_GPU_TEMP— temperature in CelsiusDCGM_FI_DEV_XID_ERRORS— Xid error count (GPU hardware errors)
nvidia-smi (NVIDIA System Management Interface)
nvidia-smi is the CLI tool available on every NVIDIA GPU node. For one-off diagnostics and quick checks:
# Live monitoring
watch -n 1 nvidia-smi
# Query specific metrics
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw \
--format=csv
# Check for Xid errors (GPU hardware errors)
nvidia-smi -q -x | grep xid node-exporter + textfile collector
For Kubernetes nodes, the standard Prometheus node-exporter runs as a DaemonSet for system-level metrics. For GPU-specific monitoring, use DCGM exporter as the primary method — it handles all GPU metrics natively without requiring custom shell scripts or textfile collectors. The Helm deployment above (dcgm-exporter) is the recommended approach for Kubernetes clusters.
Kubernetes GPU Scheduling and Monitoring
NVIDIA Device Plugin
The NVIDIA device plugin for Kubernetes exposes GPU resources to pods. Install via Helm:
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace Once installed, request GPUs in pod specs:
resources:
limits:
nvidia.com/gpu: 2 # request 2 GPUs Monitoring GPUs in Kubernetes
For Kubernetes-native GPU monitoring, use DCGM exporter as a DaemonSet with the Prometheus Operator. The DCGM exporter connects to the NVIDIA device plugin and exposes GPU metrics on port 9400 for Prometheus scraping.
Critical Kubernetes labels for GPU nodes (discovered by GPU Feature Discovery):
nvidia.com/gpu.product— GPU model (e.g., A100-SXM4-80GB)nvidia.com/gpu.memory— total VRAMtopology.kubernetes.io/region— for cross-region capacity planning
Topology-Aware Scheduling
On multi-GPU nodes (8-GPU servers are common), GPU-to-GPU communication topology matters enormously for distributed inference. On an 8-GPU A100 node:
- NVLink: intra-node GPU-to-GPU bandwidth is ~600 GB/s (vs 32 GB/s PCIe)
- NCCL: handles multi-GPU collective communication
For production multi-GPU inference, validate that your orchestrator is placing pods on GPUs with NVLink connectivity. Tools like the Kubernetes topology-manager and custom schedulers (Volcano, YuniKorn) can enforce topology-aware placement.
# Check NVLink status
nvidia-smi topo -m Multi-GPU and Distributed Inference Monitoring
Running distributed inference across multiple GPUs (tensor parallelism, pipeline parallelism) introduces monitoring complexity that single-GPU setups do not have.
NCCL Topology Monitoring
NCCL is used by all major inference servers (vLLM, TensorRT-LLM, SGLang) for multi-GPU communication. NCCL performance depends heavily on topology:
- Intra-node NVLink: ~600 GB/s bidirectional bandwidth per link
- Inter-node InfiniBand HDR: ~400 Gbps — 12x slower than NVLink
Monitor for NCCL communication time vs compute time. If NCCL AllReduce is taking 30%+ of your iteration time, you have a communication bottleneck.
Tensor Parallelism Health
In tensor parallelism, model weights are sharded across GPUs, requiring constant AllReduce operations. Monitor:
- Per-GPU memory utilization (should be roughly equal if sharding is balanced)
- AllReduce latency per layer (indicates communication bottlenecks)
- GPU-to-GPU bandwidth utilization via NVLink
If GPU 0 has significantly higher memory utilization than GPU 1-7 in a tensor-parallel setup, you have an unbalanced shard — performance will be bottlenecked by the straggler GPU.
Cost Optimization: Maximize GPU Utilization
GPU monitoring only delivers value if you act on the data. The most impactful optimizations:
1. Dynamic Batching Based on GPU Metrics
Most inference servers use static batch sizes. A smarter approach uses GPU memory utilization as the signal for batch sizing:
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
def get_optimal_batch_size(current_batch: int, target_memory_util: float = 0.80) -> int:
"""Adjust batch size based on GPU memory utilization."""
mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
memory_util = mem_info.used / mem_info.total
if memory_util < target_memory_util - 0.10:
return min(current_batch + 2, 64) # Scale up
elif memory_util > target_memory_util + 0.05:
return max(current_batch - 1, 1) # Scale down
return current_batch This is a simplified version — production implementations (like vLLM's automatic batch scheduling) do more sophisticated memory management. The key principle: use GPU memory headroom as the signal, not just queue depth.
2. MIG Partitioning for Cost Efficiency
If you are running multiple small models, MIG partitioning can cut GPU costs by 3-4x by sharing a single physical GPU across multiple inference workloads:
# Kubernetes: request a MIG instance
resources:
limits:
nvidia.com/gpu.mig-19g-1gb: 1 # 19GB MIG slice Monitor each MIG slice independently to ensure they are not oversubscribed.
3. GPU-Utilization-Based Autoscaling
Scale inference replicas based on GPU utilization, not just request queue length:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: inference-server-gpu-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference-server
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: nvidia.com/gpu
target:
type: Utilization
averageUtilization: 75 # Scale when GPU util > 75% Alerting: What to Watch For
Set up these alerts for GPU health and efficiency:
Critical (PagerDuty-level)
- GPU Xid Error:
DCGM_FI_DEV_XID_ERRORS > 0— indicates GPU hardware error, will cause crash - GPU Temperature > 83C (A100) / 87C (H100) — thermal throttling active, throughput degraded
- Uncorrectable ECC Error — GPU memory failing, leads to crashes
- GPU Power Draw > 95% TDP sustained — compute-saturated, monitor for throttling
Warning (Slack-level)
- GPU Utilization < 25% sustained — GPU is idle, investigate batching or scale workloads
- GPU Memory Utilization > 90% — risk of OOM, consider reducing batch size
- GPU Memory Utilization < 30% — memory over-provisioned, can fit larger batch or smaller GPU
- Power Efficiency < 30%:
gpu_utilization / power_drawratio is poor — batch size too low
Prometheus Queries for GPU Dashboards
# GPU Utilization (%)
avg(rate(DCGM_FI_DEV_GPU_UTIL{exported_pod=~".*"}[5m]))
# GPU Memory Utilization (%)
avg(rate(DCGM_FI_DEV_MEM_COPY_UTIL{exported_pod=~".*"}[5m]))
# VRAM Used vs Total
DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED)
# Average GPU Temperature
avg(DCGM_FI_DEV_GPU_TEMP)
# GPU Power Draw (watts)
DCGM_FI_DEV_POWER_USAGE
# Per-pod GPU utilization
avg by (exported_pod) (rate(DCGM_FI_DEV_GPU_UTIL[5m]))
# Alert: GPU thermal throttling
DCGM_FI_DEV_GPU_TEMP > 83 Grafana Dashboard Templates
For a ready-made GPU dashboard, import the "NVIDIA DCGM Exporter" dashboard from Grafana Labs (Dashboard ID: 12239). It covers GPU utilization over time, memory utilization, temperature and power, ECC errors and Xid errors, and GPU clock speeds.
For Kubernetes-native GPU monitoring, use the "NVIDIA GPU Dashboard for Kubernetes" (Grafana Labs, Dashboard ID: 13770) which correlates GPU metrics with Kubernetes pod assignments.
If you prefer to build your own, start with these panels:
- GPU Utilization (time series, per-GPU):
DCGM_FI_DEV_GPU_UTIL - VRAM Pressure (gauge, %):
DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED) * 100 - Temperature (time series, with alert threshold line):
DCGM_FI_DEV_GPU_TEMP - Power Draw (time series, watts):
DCGM_FI_DEV_POWER_USAGE - Xid Errors (stat panel, count):
DCGM_FI_DEV_XID_ERRORS
The Bottom Line
GPU monitoring is the foundation of inference efficiency. Without it, you are guessing whether your GPU budget is well-spent. The key metrics to track continuously:
- GPU Utilization — is the GPU doing work or sitting idle?
- VRAM Utilization — are you fitting the right batch size?
- Temperature — thermal throttling silently kills performance
- Power Efficiency — watts consumed per token generated
- NCCL Communication (multi-GPU) — is cross-GPU communication bottlenecking throughput?
The monitoring stack is straightforward: DCGM + Prometheus + Grafana gives you full observability with open-source tooling. The ROI is concrete — every 10% improvement in GPU utilization on an H100 paying $3/hour in cloud costs saves $0.30/hour per GPU, or $2,628/GPU/year.
Cloud GPU infrastructure built for AI inference — H100, H200, and B200 instances with NVLink and Kubernetes-native scheduling. From $2.23/GPU/hour.
Single and multi-GPU cloud instances for AI inference — RTX 4090, A100, and H100 with persistent storage. From $0.50/hr for A100.