Observability has been a buzzword since 2019, but in 2026 it means something fundamentally different than it did five years ago. The shift from reactive alerting to predictive intelligence, from monoliths to distributed AI pipelines, from human triage to autonomous incident response - all of this has changed what it means to run observable infrastructure.
From Metrics to Semantic Observability
Traditional observability - metrics, logs, traces - remains the foundation. But it is no longer sufficient on its own. The breakthrough in 2025-2026 has been semantic observability: systems that understand what the metrics mean in context, not just that they crossed a threshold.
Consider a Prometheus metric like request_latency_seconds p99: 2.3. A traditional alert fires when this crosses 2.0 seconds. A semantically observable system asks: what was the upstream service doing at that time? Did a model deployment cause increased memory pressure? Did a config change alter the tokenization pipeline? The metric alone cannot answer these questions. You need correlation across layers - infrastructure, application, and AI-specific telemetry.
OpenTelemetry has become the standard for this correlation layer. The OTLP protocol lets you pipe metrics, traces, and logs from Kubernetes, your inference servers, your vector database, and your LLM API calls into a unified store. The critical AI-specific signals that teams are now instrumenting include: prompt token count, completion token count, context window utilization, model version, temperature settings, and time-to-first-token for streaming responses.
The Three Pillars of AI Pipeline Observability
If you are running production AI systems, three distinct layers need instrumentation:
Data Pipeline Layer: Embedding generation latencies, vectorization batch sizes, RAG retrieval precision at k, chunk overlap ratios, and corpus freshness timestamps. If your retrieval is slow or returning stale context, your LLM will hallucinate regardless of how well you tune the model.
Inference Layer: GPU memory utilization, KV cache hit rates, prefill vs. decode throughput ratios, batch queue depths, and per-request latency percentiles. For vLLM specifically, the vllm:gpu_cache_usage metric is the primary OOM predictor. Track it with a warning at 85% and a critical alert at 95%.
Application Layer: End-to-end request latency, error rates by error type, prompt injection detection events, and user-reported quality scores. The gap between internal quality metrics and user perception is where most AI product teams are flying blind.
SLO-Based Alerting for AI Services
The SRE approach to alerting - alerting on error budget burn rate rather than individual metric thresholds - applies directly to AI services, with one critical adaptation. For a standard API service, an SLO might be: 99% of requests complete in under 500ms. For an AI inference service, you need separate SLOs for:
- Time to first token (streaming UX): p99 under 1.5 seconds for models up to 13B parameters
- End-to-end request latency: p99 under 30 seconds for completion tasks
- Context window utilization: average utilization under 80% to prevent OOM on long prompts
- RAG retrieval precision: top-1 cosine similarity above 0.75 on held-out evaluation set
Alert on 1% of error budget consumed in 10 minutes for all of these. This ties alerting to user impact rather than arbitrary thresholds.
eBPF: Zero-Instrumentation Network Visibility
For Kubernetes-based AI infrastructure, eBPF-based monitoring (Cilium, Pixie, Parca) has become essential. The value for AI workloads is particularly high because GPU-enabled nodes are expensive and often oversubscribed. eBPF can give you per-pod network I/O, TCP retransmit rates, and connection pooling efficiency without any application-level instrumentation.
The practical pattern: deploy Cilium as your CNI to get L7 network visibility (HTTP and gRPC request tracing) without modifying your application code. Combined with Prometheus scraping of the Cilium metrics endpoint, you get a complete picture of inter-service communication patterns - critical for understanding why your retrieval service is slow during peak inference load.
Autonomous Incident Response
The frontier of AI operations is autonomous incident response. When a model quality regression is detected - either through automated evaluation or user reports - the system should automatically: identify which recent change caused the regression (model version, prompt template, retrieval corpus update), roll back if possible, and page the responsible team with a root cause summary.
This is not science fiction. Teams running production LLM systems in 2026 are building these pipelines using the same observability data described above, with decision trees encoded in runbooks and executed by infrastructure-as-code tooling. The key enabler is evaluation infrastructure - you cannot detect regressions if you are not continuously evaluating model outputs against a golden dataset.
GPU Monitoring for AI Inference: Beyond CPU Metrics
Standard Kubernetes CPU and memory metrics miss what matters most for AI inference workloads: GPU utilization, memory bandwidth, and KV cache pressure. If you're running vLLM, TensorRT-LLM, or any GPU-accelerated inference, you need dedicated GPU observability.
NVIDIA's Data Center GPU Manager (DCGM) Exporter is the standard for this. It runs as a DaemonSet and exposes GPU metrics via Prometheus. Key metrics that matter for AI inference:
- DCGM_FI_DEV_GPU_UTIL — GPU compute utilization (%). If this is below 60% on a busy inference node, you have a batching or scheduling problem.
- DCGM_FI_DEV_MEM_COPY_UTIL — GPU memory bandwidth utilization (%). High memory copy utilization alongside low compute utilization indicates memory-bound operations, common during large batch preprocessing.
- DCGM_FI_DEV_GPU_TEMP — GPU temperature in Celsius. Throttling kicks in above 83°C and can silently degrade inference quality. Alert at 80°C.
- DCGM_FI_DEV_POWER_USAGE — Power draw in watts. Sudden drops indicate thermal throttling; sustained highs indicate heavy load.
- DCGM_FI_DEV_VRAM_UTIL — VRAM utilization. If you're consistently above 90%, you risk OOM on long context windows.
For vLLM specifically, the vllm:gpu_cache_usage metric is your primary OOM predictor. The KV cache lives in GPU memory — if it fills up, vLLM starts evicting entries and inference quality degrades. Set a warning at 85% and a critical alert at 95%:
- alert: VLLMGPUCacheHigh
expr: vllm:gpu_cache_usage > 0.85
for: 2m
labels:
severity: warning
- alert: VLLMGPUCacheCritical
expr: vllm:gpu_cache_usage > 0.95
for: 1m
labels:
severity: critical The gap between allocated VRAM and actual usage is your inefficiency signal. If you're allocating 80% of VRAM for KV cache but only using 50%, you could serve larger batches or longer context windows with the same hardware.
Reserved Instances and Savings Plans: FinOps for AI Infrastructure
For production AI inference at scale, compute costs dominate. GPU nodes on AWS, GCP, or Azure can run $2-10 per hour. Reserved Instance (RI) coverage and Savings Plans are the single most impactful FinOps decision for AI infrastructure teams.
The math: on-demand A100 instances on AWS run about $4.60/hr each. A 1-year RI brings that to roughly $2.75/hr — a 40% savings. For a team running 10 GPU nodes continuously, that difference is $16,000+ per year.
Three patterns for AI infrastructure RI strategy:
Pattern 1: Baseline RI for Predictable Batch Workload
If you run batch inference on a fixed schedule (overnight fine-tuning data processing, morning report generation), reserve enough instances to cover your baseline plus 20%. Handle burst demand with on-demand spike capacity. This typically covers 70-80% of your total GPU-hours at 40% lower cost.
Pattern 2: Savings Plans for Inference Endpoints
For inference servers with variable traffic, AWS Savings Plans or GCP Committed Use Discounts work better than RIs because they apply to usage that fluctuates. The commitment gives you a discounted rate on all usage up to the committed amount, with on-demand pricing above it — no hard cutoff like RI coverage gaps.
Pattern 3: Spot Instances for Fault-Tolerant Batch AI
Training jobs and batch inference are interruption-tolerant. Use spot/preemptible instances at 60-70% off for fault-tolerant workloads. Set up checkpointing so a 2-hour training job can resume from the last checkpoint after a preemption. For vLLM batch inference, use chunked processing with intermediate result persistence.
The monitoring metric for RI coverage efficiency: sum(gpu_hours_reserved) / sum(gpu_hours_total). If your coverage ratio drops below 60%, you're paying on-demand premiums unnecessarily. If it exceeds 95%, you've over-committed and are wasting money on idle capacity.
Tag every GPU node with team, workload type, and environment (prod/staging/dev). This lets you attribute costs precisely — AI infrastructure costs tend to hide in shared cluster resources without proper tagging.
What to Instrument Today
If you are building this for the first time in 2026, start with the minimum viable observability stack: Prometheus for metrics, Loki for logs, Tempo for traces, Grafana for visualization. Then add the AI-specific exporters for your inference layer (vLLM has a native Prometheus endpoint, OpenAI-compatible APIs can use the openai-observer library). The ROI on observability investment is highest before you have your first major incident - once you have experienced a 3am page on a production AI outage, you will understand exactly what you should have instrumented.
OpenTelemetry: The Universal Correlating Layer
OpenTelemetry has emerged as the observability layer that connects infrastructure metrics, application traces, and AI-specific telemetry into a unified pipeline. For Kubernetes-based AI infrastructure, OTel is now the de facto standard because it provides correlation IDs that let you trace a user request from the ingress controller, through your inference gateway, into your LLM provider, through your vector retrieval, and back to the response — all with a single trace ID.
The practical setup for an AI inference service with OTel:
# OTEL Collector config for AI inference workloads
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['vllm-service:8000']
metrics_path: /metrics
processors:
batch:
timeout: 1s
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
prometheus_remote_write:
endpoint: https://your-prometheus-endpoint/api/v1/write
jaeger:
endpoint: https://your-jaeger-endpoint:14250
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, memory_limiter]
exporters: [jaeger]
metrics:
receivers: [prometheus]
processors: [batch, memory_limiter]
exporters: [prometheus_remote_write] The critical AI-specific span attributes you should instrument on every LLM call:
span.set_attribute("ai.request.model", "gpt-4o")
span.set_attribute("ai.request.max_tokens", 4096)
span.set_attribute("ai.request.temperature", 0.7)
span.set_attribute("ai.input_tokens", 2340)
span.set_attribute("ai.output_tokens", 892)
span.set_attribute("ai.latency.first_token_ms", 1200)
span.set_attribute("ai.latency.total_ms", 3400) These attributes let you slice and dice LLM performance by model version, prompt length, temperature setting, and token count — the four variables that most affect inference cost and quality.
Observability Patterns for Agentic AI Systems
Agentic AI systems — where LLMs take multi-step actions, call tools, and maintain state across interactions — require a fundamentally different observability approach than simple inference. The challenge is that a single "request" can generate dozens of LLM calls, tool executions, and state mutations, all of which need to be traceable as a single conversation unit.
For agentic systems, you need to instrument at three additional layers:
- Tool call tracing: Every tool invocation (search, code execution, database queries) should be a child span with the tool name, input, output size, and execution time as attributes. This lets you identify which tools are slow or frequently failing.
- State mutation tracking: If your agent updates a memory store, writes to a database, or modifies shared state, log the mutation with a correlation ID so you can replay the conversation.
- Reasoning trace export: Some agent frameworks (LangSmith, Weights & Biases W&B, Arize Phoenix) can export the LLM's reasoning steps — token-by-token log probabilities — as structured trace data. This is invaluable for debugging why an agent made a bad decision.
# Example: Tool call span in an agentic system
span = tracer.start_span("tool:vector_search")
span.set_attribute("tool.name", "pinecone_query")
span.set_attribute("tool.input_tokens", 128)
span.set_attribute("tool.results_count", 5)
span.set_attribute("tool.execution_ms", 45)
# ... tool execution ...
span.set_attribute("tool.output_tokens", 890)
span.end() The highest-value metric for agentic systems is tool call success rate by tool type. If your vector search is failing 15% of the time but your agent is configured to retry up to 3x, you are adding 450ms of latency per retry cycle on every retrieval step. Instrument the tool call, not just the overall request.
Reference Architecture: Production AI Observability Stack
Here is the reference architecture that combines all the patterns described above into a production-ready stack you can deploy in a day with open-source tooling:
- Data collection: Prometheus for metrics (kube-prometheus-stack), Loki for logs, Tempo for traces, OTel Collector as the aggregation layer
- AI inference telemetry: vLLM Prometheus endpoint for GPU and inference metrics, openai-observer for API-level tracing, Helicone for LLM cost and quality tracking
- Storage: Prometheus for 15-day hot storage, Thanos sidecar for long-term S3/GCS storage and federated querying
- Visualization: Grafana with pre-built dashboards for Kubernetes, GPU monitoring (DCGM), and AI inference SLOs
- Alerting: Alertmanager with routing to PagerDuty (critical), Slack (warnings), and email (informational)
For teams running multiple AI models (a mix of OpenAI, Anthropic, self-hosted vLLM), the critical capability is a unified cost and quality dashboard that lets you compare per-token cost versus per-request latency across all providers. Build this with Grafana's multi-tenant data source federation and a shared metrics namespace.
The Observability Maturity Model
Not every team needs the full reference architecture on day one. Here is the maturity model that maps investment level to operational capability:
- Level 0 — Metrics only: CPU, memory, disk. You know when a server is down, not why. Cost: free. Engineering effort: 1 hour.
- Level 1 — APM integration: Datadog or New Relic APM agent deployed. You see request traces and errors. Cost: $15/host/month. Engineering effort: 2 hours.
- Level 2 — Kubernetes-native with AI telemetry: kube-prometheus-stack + DCGM exporter + vLLM metrics. You see cluster, pod, and GPU metrics. Cost: infrastructure only. Engineering effort: 1 day.
- Level 3 — Full OTel pipeline: OTel Collector with traces, metrics, and logs correlated. You can trace a request end-to-end. Cost: infrastructure + engineering. Engineering effort: 1 week.
- Level 4 — Autonomous incident response: Evaluation pipeline, regression detection, automated rollback. Cost: significant. Engineering effort: 1 quarter.
Most teams should target Level 2 as a starting point and Level 3 as the 6-month goal. Level 4 is where the frontier is — only teams with dedicated MLOps engineers and an existing evaluation culture should attempt it.
Grafana Cloud gives you the full observability stack — Prometheus metrics, Loki logs, Tempo traces, and a unified dashboard — without managing infrastructure. The free tier includes 10K active metrics and 50GB logs. For AI inference workloads, Grafana's GPU monitoring dashboards integrate directly with vLLM and TensorRT-LLM Prometheus endpoints.