Kubernetes has won. It is the de facto orchestration layer for production workloads, from stateless microservices to GPU-accelerated inference servers. But Kubernetes visibility is not solved — most teams have monitoring stacks that give them data, not signal.

This guide builds a complete production-ready monitoring stack: kube-prometheus-stack for metrics collection, Grafana dashboards for visualization, kube-state-metrics for cluster state, and Cilium eBPF for network observability without sidecars.

The Core Stack: kube-prometheus-stack

The fastest path to a working Kubernetes monitoring stack is the kube-prometheus-stack Helm chart. It packages Prometheus, Alertmanager, node-exporter, kube-state-metrics, and pre-built Grafana dashboards into a single deployment with sensible defaults.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.replicas=2 \
  --set alertmanager.persistentVolume.storageClass=standard

This gives you a 15-day retention window, two Prometheus replicas for HA, and persistent storage for Alertmanager so you do not lose silencing rules on restart. After installing, port-forward to access Grafana:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

Default credentials are admin / prom-operator — change these immediately in production via the --set grafana.adminPassword= flag on install.

Kube-State-Metrics: What It Actually Emits

Kube-state-metrics listens to the Kubernetes API and emits metrics about the desired vs. actual state of your cluster objects. The key metrics and what they tell you:

  • kube_deployment_status_replicas_available — are your replicas actually running? Compare against desired to detect rollout stalls.
  • kube_pod_container_status_running — which pods have a running container? A 0 means the pod is either pending, failed, or in an unusual state.
  • kube_pod_container_resource_limits vs kube_pod_container_resource_requests — are you requesting the right resources? Oversubscribed limits cause OOMKilled events.
  • kube_persistentvolume_status_phase — is your storage actually bound and healthy? Pending volumes block pod scheduling.

The critical query for deployment health:

# Deployment availability rate
kube_deployment_status_replicas_available
  / kube_deployment_status_replicas_desired

# Alert if below 100% for more than 5 minutes
- alert: DeploymentReplicasUnavailable
  expr: kube_deployment_status_replicas_available / kube_deployment_status_replicas_desired < 1
  for: 5m
  labels:
    severity: warning

Node-Exporter: Host-Level Metrics

The node-exporter DaemonSet runs on every node and exposes hardware and OS-level metrics. Once deployed via kube-prometheus-stack, query node resource utilization across your fleet:

# Node CPU utilization (1 - idle)
1 - rate(node_cpu_seconds_total{mode="idle"}[5m])

# Node memory utilization
1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

# Node disk I/O latency (avg wait time)
rate(node_disk_io_time_seconds_total[5m]) * 100

# Filesystem utilization percentage
node_filesystem_usage_bytes / node_filesystem_size_bytes

A practical Grafana panel configuration for node CPU across your cluster. Set the unit to percentunit (0-1 range) and configure thresholds: green below 70%, yellow 70-85%, red above 85%.

Prometheus at Scale: Thanos or Mimir

The operational challenge with Prometheus at scale is storage. Prometheus is not infinitely scalable — a single instance handles well up to about 1M active time series before you hit performance limits. For larger clusters, migrate to Thanos for long-term storage and global query federation, or Mimir (Grafana's offering) for horizontally scalable Prometheus with built-in Alertmanager integration.

The Thanos pattern: sidecar containers in each Prometheus pod upload TSDB blocks to object storage (S3 or GCS) every 2 hours. The store gateway then serves historical data from object storage. The querier component provides a unified PromQL interface across all Prometheus instances, so your Grafana dashboards work across your entire fleet.

A minimal Thanos sidecar configuration for your prometheus:

prometheus:
  prometheusSpec:
   containers:
    - name: thanos-sidecar
      image: quay.io/thanos/thanos:v0.36.0
      args:
        - sidecar
        - --prometheus.url=http://localhost:9090
        - --tsdb.path=/prometheus
        - --objstore.config-file=/etc/thanos/object-store.yaml
      volumeMounts:
        - name: prometheus-data
          mountPath: /prometheus
        - name: thanos-objstore
          mountPath: /etc/thanos

SLO-Based Alerting for Kubernetes Services

The biggest operational failure in Kubernetes observability is alert fatigue. SRE teams that receive hundreds of alerts per week start ignoring all of them. Define SLOs for your Kubernetes services: latency SLO (p95 response time under Xms), availability SLO (error rate below Y%), and throughput SLO (requests per second above Z). Then alert on SLO burn rate — not individual metric thresholds.

The multi-window burn rate alerting pattern from the SRE workbook: alert on 1% budget burned in 1 hour (fast burn, page immediately) and 5% budget burned in 6 hours (slow burn, send to Slack). This ties alerting to user impact rather than infrastructure noise.

# Kubernetes SLO alert: API server availability
# 99.9% availability means 0.1% error budget per month
# Fast burn: 1% of budget in 1h = page immediately
# Slow burn: 5% of budget in 6h = Slack alert

- alert: K8sAPIServerFastBurn
  expr: |
    1 - (sum(rate(apiserver_requests_total{code=~"5.."}[1h]))
    / sum(rate(apiserver_requests_total[1h]))) > 0.001
  for: 5m
  labels:
    severity: critical

- alert: K8sAPIServerSlowBurn
  expr: |
    1 - (sum(rate(apiserver_requests_total{code=~"5.."}[6h]))
    / sum(rate(apiserver_requests_total[6h]))) > 0.005
  for: 30m
  labels:
    severity: warning

eBPF for Network Observability

Cilium has become the CNI of choice for teams that need network observability without application changes. Cilium's eBPF data path gives you per-connection metrics — TCP retransmits, connection latency, HTTP error rates — without any sidecar proxies or application instrumentation.

For AI workloads running in Kubernetes, Cilium's network observability is particularly valuable because GPU-intensive workloads tend to have distinctive network patterns: large burst transfers for model weights, persistent connections for streaming inference, and sensitive latency requirements for real-time serving.

Cilium metrics are exposed on port 9090. Key metrics to track:

# Cilium network errors
cilium_drop_count_total
cilium_forward_count_total

# TCP connection stats
cilium_tcp_connections_established
cilium_tcp_connections_closed

# L7 HTTP traffic visibility
cilium_http_requests_total

Key Grafana Dashboards to Enable

The kube-prometheus-stack ships with pre-built dashboards that cover most of what you need. After installing, go to Dashboards → Manage in Grafana and enable these key ones:

  • Kubernetes / Compute Resources / Cluster — node CPU and memory across your entire cluster with utilization percentages
  • Kubernetes / Compute Resources / Namespace (Pods) — per-namespace pod-level CPU and memory with the ability to drill down into specific workloads
  • Kubernetes / Networking / Pod — per-pod network I/O if Cilium is deployed as your CNI
  • Kubernetes / Persistent Volumes — PVC status, capacity, and utilization tracking across all mounted volumes

If you are running GPU workloads, add the DCGM exporter dashboard (available in Grafana's dashboard registry under "NVIDIA DCGMExporter"). It shows GPU utilization, memory utilization, temperature, and power draw for each node.

GPU Monitoring in Kubernetes

For AI inference workloads, GPU monitoring is non-negotiable. NVIDIA's DCGM (Data Center GPU Manager) Exporter runs as a DaemonSet and exposes GPU metrics via Prometheus. Install via the DCGM exporter Helm chart:

helm install dcgm-exporter nvidia/dcgm-exporter \
  --namespace monitoring \
  --set gpu.metrics.enabled=true

Key GPU metrics to track:

# GPU utilization (percent)
DCGM_FI_DEV_GPU_UTIL

# GPU memory utilization (percent)
DCGM_FI_DEV_MEM_COPY_UTIL

# GPU temperature in Celsius (throttling risk above 83C)
DCGM_FI_DEV_GPU_TEMP

# GPU power draw in watts
DCGM_FI_DEV_POWER_USAGE

For vLLM specifically, the vllm:gpu_cache_usage metric is the primary OOM predictor. Track it with a warning at 85% and a critical alert at 95%:

- alert: VLLMGPUCacheHigh
  expr: vllm:gpu_cache_usage > 0.85
  for: 2m
  labels:
    severity: warning

- alert: VLLMGPUCacheCritical
  expr: vllm:gpu_cache_usage > 0.95
  for: 1m
  labels:
    severity: critical

Multi-Cluster Monitoring: Three Patterns at Different Scales

When you run multiple Kubernetes clusters — staging, production, and maybe a dedicated GPU cluster for inference — you need unified visibility. Three patterns work at different scales:

Pattern 1: Thanos Receive (better for 2–5 clusters)
Thanos Receive ingests remote Prometheus metrics via the remote-write protocol. Each cluster ships its metrics to a central Thanos Receive gateway, which then stores them in object storage. Your Grafana querier talks to the Thanos Query layer, which federates across all clusters transparently.

Pattern 2: Grafana Federation (simpler, limited scale)
Grafana's built-in data source federation lets you add each Prometheus as a data source and query across them. This works for 3–4 clusters but does not solve global alerting or long-term storage.

Pattern 3: Grafana Cloud Kubernetes Monitoring (managed, priced per-series)
Grafana Cloud's Kubernetes integration automatically discovers clusters, scrapes metrics, and gives you pre-built dashboards without any Prometheus operational overhead. The trade-off is cost — Grafana Cloud charges by active series and data ingestion volume. A production cluster with 50,000 active series runs about $90/month on Grafana Cloud's pay-as-you-go plan.

Prometheus Operator pattern (for GitOps teams)
If you manage Kubernetes declaratively with ArgoCD or Flux, the Prometheus Operator CRDs let you define Prometheus, Alertmanager, and ServiceMonitor as Kubernetes resources. This means your monitoring configuration is version-controlled and deployed like any other workload — no YAML code block needed, just describe the intent in your Helm values or kustomize overlays.

Kubernetes Monitoring Tools Comparison

For teams evaluating monitoring approaches, here is how the major options compare for Kubernetes:

Tool Best For Deployment Cost Complexity
kube-prometheus-stack Self-hosted, full control Helm on your K8s Free (infra only) Medium
Datadog Enterprise, unified APM + infra DaemonSet + Operator $15+/host/month Low
Grafana Cloud Teams wanting managed Prometheus SaaS $90+/cluster/month Very Low
New Relic Teams already in New Relic ecosystem Agent-based $0.10/GB ingested Low
Thanos + Mimir Multi-cluster, long retention Self-hosted Infrastructure only High
Cilium + Hubble Network observability focus CNI replacement Free (open source) High

When Datadog makes sense: If your team is under 10 people and spending more than 20 hours/month on Prometheus operations, Datadog's $15/host pricing pays for itself. The Kubernetes dashboard auto-discovers services, pods, and deployments without any ServiceMonitor configuration. The downside: Datadog's agent runs as a DaemonSet and consumes roughly 2–3% CPU on each node.

When Grafana Cloud makes sense: Small teams (1–5 engineers) who want Prometheus-grade visibility without Prometheus operational expertise. The Grafana Kubernetes Dashboard is genuinely good — better than anything you will build yourself in the first 6 months.

When self-hosted kube-prometheus-stack makes sense: Teams with dedicated platform or SRE engineers, cost-sensitive organizations, or teams that need deep customization of their alerting rules and metric cardinality.

Sample Grafana Dashboard: Cluster Overview

A practical Cluster Overview dashboard you can import directly into Grafana. Go to Dashboards → Import and paste this JSON:

{
  "title": "StackPulsar / Kubernetes Cluster Overview",
  "panels": [
    {
      "title": "CPU Utilization",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
      "targets": [{"expr": "sum(rate(node_cpu_seconds_total[5m])) by (instance)", "legendFormat": "${instance}"}],
      "fieldConfig": {"defaults": {"unit": "percentunit", "thresholds": {"mode": "absolute", "steps": [{"value": 0, "color": "green"}, {"value": 0.7, "color": "yellow"}, {"value": 0.85, "color": "red"}]}}}}
    },
    {
      "title": "Memory Utilization",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
      "targets": [{"expr": "1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)", "legendFormat": "${instance}"}]
    },
    {
      "title": "Deployment Replica Availability",
      "type": "gauge",
      "gridPos": {"h": 8, "w": 8, "x": 8, "y": 8},
      "targets": [{"expr": "sum(kube_deployment_status_replicas_available) / sum(kube_deployment_status_replicas_desired)"}],
      "fieldConfig": {"defaults": {"unit": "percentunit", "min": 0, "max": 1, "thresholds": {"mode": "absolute", "steps": [{"value": 0, "color": "red"}, {"value": 0.9, "color": "yellow"}, {"value": 1, "color": "green"}]}}}}
    }
  ]
}

Import this by going to Grafana → Dashboards → Import → paste the JSON. Change the namespace and instance label filters to match your cluster topology.

Expanded AI/ML Workload Monitoring

Beyond GPU metrics, AI inference workloads in Kubernetes have unique monitoring requirements that standard Kubernetes dashboards do not capture:

Inference latency SLOs: For real-time inference (chat completions, embeddings), your SLO is typically p95 latency under 500ms for embeddings and under 2s for chat completions. Monitor these with:

# vLLM inference latency histogram
histogram_quantile(0.95, rate(vllm_engine_iteration_duration_seconds_bucket[5m]))

# Request throughput
rate(vllm:request_success_total[5m])

# TensorRT-LLM time to first token (TTFT)
histogram_quantile(0.95, rate(tensorrt_llm_request_forward_duration_seconds_bucket[5m]))

Batch inference monitoring: For batch inference jobs — bulk processing, fine-tuning data preparation — the key metric is batch completion time versus deadline. Track with:

# Batch job completion rate
rate(kube_batch_job_complete[5m])

# Pod restart count (high restarts = OOMKilled or preemption)
rate(kube_pod_container_status_restarts_total{service="batch-inference"}[1h])

Multi-tenant GPU cluster scheduling: If you run shared GPU clusters for multiple teams, monitor GPU share allocation versus utilization. The nvidia.com/gpu resource requests should track against actual compute to identify underutilized allocations:

# GPU utilization vs request ratio
DCGM_FI_DEV_GPU_UTIL / on(instance) group_left()
  kube_pod_container_resource_requests{resource="nvidia.com/gpu"}

High ratios (GPU util much greater than requested) mean teams are under-requesting and getting throttled. Low ratios (requested much greater than util) mean wasted budget — those GPUs could be reallocated to more inference requests.

Recommended Tool Datadog

Full-stack monitoring for Kubernetes — auto-discovers pods, deployments, and services. Includes APM, log management, and network monitoring in one agent. 14-day free trial.

What to Deploy Today

Start with kube-prometheus-stack — it gets you 80% of production-grade Kubernetes monitoring in a single Helm install. Then layer in:

  • Cilium for network observability (replaces kube-proxy with eBPF)
  • DCGM exporter if you are running GPUs
  • Thanos sidecar for long-term metric retention
  • The Grafana dashboard above for instant cluster visibility

The ROI on monitoring investment is highest before you have your first major incident. Once you have experienced a 3am page on a production Kubernetes outage with no visibility into why your pods are OOMKilled, you will understand exactly what you should have instrumented.

Recommended Tool DigitalOcean

Spin up a managed Kubernetes cluster in minutes. DigitalOcean Kubernetes includes built-in monitoring, auto-scaling, and a simple UI — starting at $6/mo.

Recommended Tool Grafana Cloud

Managed Prometheus, Grafana, and alerting — no operational overhead. Grafana Cloud Kubernetes Monitoring auto-discovers your clusters and ships with pre-built dashboards. Free tier available.

Recommended Tool Weights & Biases

ML experiment tracking and model monitoring that integrates with Kubernetes. Track training runs, log inference metrics, and visualize model drift — free for individuals, teams from $20/month.