Commercial LLM observability platforms like Helicone, Portkey, and LangSmith have solved the "getting started" problem. Point your OpenAI API at their proxy, and you get tracing, metrics, and dashboards in minutes. But that convenience comes with a price: per-token fees that scale with your usage, vendor lock-in that makes migration painful, and data processing agreements that your legal team will dread.
For teams running AI infrastructure at scale - particularly those self-hosting models on-premises or in a VPC - the open source path is not just viable, it is often preferable. You own your data, you own your infrastructure, and your observability costs stay flat regardless of how many tokens you process.
This guide builds a complete open source LLM monitoring stack from scratch: five open source tools that together give you the same visibility as a commercial LLM observability platform, for the cost of running the infrastructure alone.
Why Open Source for LLM Monitoring?
Before the tools, the question: when does going open source make sense?
Data privacy is non-negotiable. If you are processing prompts that contain proprietary code, customer data, or legally sensitive information, sending that data to a third-party observability service may be a compliance blocker. Healthcare companies, financial services, and defense contractors often fall into this category. An on-premises observability stack keeps all data within your environment, no data processing agreements required.
Cost at scale is different. Commercial platforms typically charge per million tokens processed. At 10 billion tokens per month - a realistic scale for a mid-sized AI product - those per-token fees compound fast. The equivalent open source infrastructure costs what your cloud bills already cost, plus a modest overhead for the observability layer.
Customization and depth. Commercial platforms surface what their product team decided to instrument. When you run open source, you instrument exactly what you care about. For teams that have built proprietary evaluation frameworks or have domain-specific quality metrics, open source gives you the flexibility to capture and query exactly those signals.
The tradeoff is real: open source requires more setup time (expect 4-8 hours for a production-ready deployment) and ongoing maintenance. For early-stage teams or those just starting with LLM operations, the commercial platforms are probably the right call. But if any of the above conditions apply to you, read on.
OpenTelemetry - Instrumenting Your LLM Calls
OpenTelemetry (OTel) is the open standard for distributed systems observability. It provides a unified framework for collecting traces, metrics, and logs - three signal types that together give you complete visibility into an LLM request lifecycle.
What to Capture in an LLM Span
When a user prompt enters your system and a response comes out, that is a single logical operation - but internally it involves multiple steps: prompt preprocessing, embedding generation, retrieval, context assembly, model inference, and response post-processing. Each step is a span in the trace.
A well-instrumented LLM span captures:
- Model identification: model name, version, provider
- Token counts: prompt tokens, completion tokens, total
- Latency breakdown: time to first token (TTFT), time per output token (TPOT), total end-to-end latency
- Invocation metadata: temperature, top-p, seed (for reproducibility)
- Semantic identifiers: user ID, session ID, request ID for correlation
- Cost signals: estimated cost per request (using model pricing data)
Python Instrumentation with OAI Instrumentation
The openinference-instrumentation library (from Arize) instruments LLM providers automatically. It wraps OpenAI, Anthropic, vLLM, and other providers with OTel spans and captures the signals above:
pip install openinference-instrumentation openinference-instrumentation-openai opentelemetry-exporter-otlp from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from openinference.instrumentation.openai import OpenAIInstrumentor
# Initialize OTel
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
# Auto-instrument all LLM calls
OpenAIInstrumentor().instrument() Once instrumented, every LLM API call produces a span with model, token counts, latency, and custom attributes. The same pattern works for Anthropic, vLLM, and any other provider with an OTel instrumentation package.
For vLLM specifically, the openinference-instrumentation-vllm package captures the full prefill/decode cycle with memory utilization signals.
The OTel Collector - Your Observability Router
The OpenTelemetry Collector is a vendor-neutral proxy that receives observability data and routes it to backends. It decouples your instrumentation from your storage - you can send traces to Jaeger for development and Grafana Tempo for production without changing any code.
# docker-compose snippet for the OTel Collector
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8888:8888" # Prometheus metrics exposed by collector Prometheus - Scraping vLLM Metrics
Prometheus is the standard for metrics collection in cloud-native environments. vLLM exposes a native /metrics endpoint in Prometheus format, making it a natural fit for the monitoring stack.
vLLM Metrics That Matter
The key metrics from vLLM's Prometheus endpoint:
vllm:gpu_cache_usage_ratio: KV cache utilization. 0.85-0.95 is healthy; above 0.98 means OOM riskvllm:num_generation_tokens_total: cumulative tokens generated, use to calculate throughputvllm:num_queue_tokens: tokens waiting in the request queue - spikes here indicate saturationvllm:time_to_first_token_seconds: TTFT histogram - critical for streaming UXvllm:time_per_output_token_seconds: TPOT histogram - reflects model inference speed per tokenvllm:e2e_request_latency_seconds: total request latency from receipt to final token
Prometheus Scrape Config for vLLM
# prometheus.yml
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['vllm-server:8000'] # vLLM's /metrics endpoint
metrics_path: '/metrics'
scrape_interval: 15s
scrape_timeout: 10s If you are running vLLM behind a Kubernetes Service, use the prometheus-operator with a ServiceMonitor CRD for declarative Prometheus discovery:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-monitor
spec:
selector:
matchLabels:
app: vllm
endpoints:
- port: metrics
interval: 15s
path: /metrics Grafana Dashboard for vLLM
The most actionable dashboard panels for production monitoring:
- TTFT p50/p95/p99 over time: if p99 climbs above 2 seconds on your target model, you have a prefill bottleneck
- TPOT distribution: a rising TPOT over weeks is a signal of model degradation or KV cache pressure
- KV cache hit rate (if using prefix caching): low hit rate means you are regenerating too many cached tokens
- Request queue depth: the number of queued requests is your backpressure signal - when this climbs, scale horizontally
Grafana - Unified Visualization Layer
Grafana is the visualization layer for the entire stack. Traces from Tempo, metrics from Prometheus, and logs from Loki all flow into the same Grafana instance, enabling correlation across signal types that no commercial platform does better.
The LLM Monitoring Dashboard Layout
A production-ready LLM monitoring dashboard has four quadrants:
Top-left - Request Volume and Cost. Tokens processed per hour (prompt and completion separately), estimated cost per hour based on your model pricing table, and cost per 1,000 requests. If your cost-per-1k-requests climbs week-over-week without a corresponding traffic increase, something has changed in your prompt patterns.
Top-right - Latency Percentiles. TTFT and TPOT at p50, p95, p99. The goal is TTFT p99 under 2 seconds for most models. TPOT should be stable; a climbing TPOT is an early warning of GPU saturation.
Bottom-left - Error Rates. API errors by error code (auth failures, rate limits, upstream errors), broken connection rates, and timeout percentages. Rate limit errors (429) deserve their own panel - they indicate your load shedding is working correctly, but if 429s spike, you may need to implement better batching or model routing.
Bottom-right - GPU Utilization (for self-hosted). GPU memory utilization, GPU compute utilization, and KV cache usage. The gap between your GPU memory utilization and your actual throughput is the signal that tells you whether you are underutilizing or oversubscribing your GPU.
Grafana Cloud vs. Self-Hosted
Grafana Cloud's hosted offering (starting at free) includes managed Prometheus and Grafana instances, which eliminates the operational burden of running your own Prometheus cluster. For most teams under 50 hosts, the hosted offering is cost-effective. The managed Prometheus has ingestion limits - check the current limits before committing, as AI workloads can generate high metric volumes.
If you need unlimited ingestion or must run in an air-gapped environment, self-hosted Grafana with a Prometheus backend is the path.
Grafana Cloud - unified observability for metrics, logs, and traces. Free tier available, managed Prometheus and Grafana with no infrastructure to maintain.
Loki - Aggregating LLM Logs
Prometheus gives you quantitative metrics. Traces give you request-level visibility. But the full prompt-and-response log - what your model actually received and returned - requires a log aggregation system. This is critical for debugging quality issues, auditing model behavior, and investigating incidents.
Grafana Loki is the log aggregation system designed to work with Grafana and Prometheus. Unlike Elasticsearch, Loki does not index log content - it indexes log labels. This makes it dramatically cheaper to operate at scale, and for LLM logs where you care about the labels (model, user_id, request_id, token_count) rather than full-text search, it is the right tool.
Structured Logging for LLM Requests
import json
import logging
from opentelemetry import trace
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("llm-monitor")
def log_llm_request(prompt_tokens, completion_tokens, model, latency_ms, cost):
span_ctx = trace.get_current_span().context
logger.info(
json.dumps({
"event": "llm_request",
"model": model,
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"latency_ms": latency_ms,
"estimated_cost_usd": cost,
"trace_id": span_ctx.span_id if span_ctx else None
})
) Loki's logcli tool lets you query logs directly from the terminal:
# Query Loki for all requests from a specific user in the last hour
logcli query '{job="llm-requests"} |= "user_id=user_abc123"' --since=1h
# Count error rate by model over the last 30 minutes
logcli query 'rate({job="llm-requests"} | json | status="error"[5m])' --step=1m Log Volume Management
LLM logs are verbose. A single request with 2,000 tokens in and 500 tokens out generates a JSON log line of several kilobytes. Two mitigations: sample aggressively in non-production environments, and configure Loki's retention policies to auto-delete logs older than 90 days.
Alertmanager - Alerting That Does Not Suck
The final component: getting alerted when something breaks. The pitfall most teams fall into is alert fatigue - 300 Slack notifications per day from monitoring noise, until the real incident arrives and nobody reads the noise. SLO-based alerting fixes this.
SLO-Alerting for LLM Systems
Define two SLOs for your LLM service:
- Latency SLO: 95% of requests complete within TTFT threshold (e.g., 1.5 seconds for a 7B model)
- Availability SLO: 99.5% of requests return a successful response (non-5xx, non-timeout)
The burn rate alerting pattern: alert on two windows. Fast burn (5% budget burned in 1 hour) pages immediately. Slow burn (1% budget burned in 6 hours) routes to Slack. This means you get paged on real incidents while Slack catches the slow degradations.
# Alertmanager configuration snippet
groups:
- name: llm-slo-alerts
rules:
- alert: LLMLatencySLOBurn
expr: |
(sum(rate(vllm_e2e_request_latency_seconds_bucket{le="1.5"}[1h]))
/ sum(rate(vllm_e2e_request_latency_seconds_count[1h])))
< 0.95
for: 5m
labels:
severity: critical
annotations:
summary: "LLM latency SLO burning fast - TTFT p95 above threshold" The Full Stack - Putting It Together
Here is how all five components connect:
- Application → OTel SDK: Your Python application is instrumented with OpenTelemetry. Every LLM call produces a span captured by the OTel SDK.
- OTel SDK → OTel Collector: Spans are exported via OTLP to the OTel Collector running as a sidecar or a central service.
- OTel Collector → Tempo / Loki / Prometheus: The Collector fans out: trace spans go to Grafana Tempo, structured JSON logs go to Loki, and metrics are scraped by Prometheus via a receiver in the Collector.
- Prometheus / Tempo / Loki → Grafana: Grafana queries all three data sources. You correlate a latency spike in Prometheus to the specific trace in Tempo to the actual prompt/response log in Loki - all in the same UI, with one click.
- Prometheus → Alertmanager → PagerDuty / Slack: When SLO burn rates exceed thresholds, Prometheus Alertmanager fires alerts routed to PagerDuty (for critical) or Slack (for warnings).
Open Source vs. Commercial - When to Use Which
The commercial platforms - Helicone, Portkey, LangSmith - are purpose-built for teams using hosted LLMs via API. They solve the instrumentation problem in one line of code and provide dashboards that are immediately useful. If you are in the evaluation or early-production phase, use them.
The open source stack described here is for teams that have moved past evaluation and need full ownership of their observability infrastructure. The crossover point is roughly when your LLM API spend crosses $5,000/month - at that point, even a $500/month commercial platform subscription is worth examining against the engineering cost of maintaining the OSS stack.
There is also a hybrid approach: use a commercial platform for your hosted model traces and the OSS stack for your self-hosted inference infrastructure. The commercial platform handles OpenAI and Anthropic calls; vLLM metrics flow into your Prometheus. Most mature AI systems end up here.
Conclusion
The open source LLM monitoring stack is not a lesser alternative to commercial platforms - it is a different tool for a different stage. Early-stage teams benefit from the speed and simplicity of commercial platforms. Teams that have crossed into serious scale, data sensitivity requirements, or self-hosted infrastructure benefit from the control and cost predictability of the OSS path.
The setup time is real - budget 4-8 hours for a production-ready deployment if you are familiar with Prometheus and Grafana, longer if you are learning them from scratch. But once it is running, it runs. Your observability costs stay flat regardless of token volume, your data never leaves your infrastructure, and you have complete flexibility in what you measure.
For a deep-dive guide on vLLM-specific metrics and Prometheus configuration, see our article on vLLM Production Monitoring. For a comparison of commercial LLM observability platforms, see LLM Observability Tools in 2026.