1. Introduction
Every production LLM application needs observability. Not optional, not "nice to have" — essential. Token costs compound silently, latency spikes destroy user experience, and model hallucinations propagate before anyone catches them.
The standard approach is to reach for a commercial observability platform. And that works — until you hit per-seat pricing at scale, vendor lock-in on your telemetry data, or compliance requirements that keep your metrics inside your own infrastructure.
This tutorial shows you how to build a complete, open-source LLM monitoring stack using OpenTelemetry for instrumentation, Prometheus for metrics collection, and Grafana for visualization. It costs nothing to run, keeps all data in your own environment, and connects directly to the broader Cloud-Native ecosystem.
By the end you will have: auto-instrumented LLM calls with token counting, latency histograms, cost calculation, error tracking, and a Grafana dashboard that shows your LLM health at a glance.
2. Why OpenTelemetry for LLM Monitoring?
OpenTelemetry (OTel) has become the de facto standard for observability instrumentation in cloud-native applications. The project combines tracing, metrics, and logging under a single vendor-neutral SDK with auto-instrumentation agents that require zero code changes for common frameworks.
For LLM applications, OTel solves three problems that make native LLM API monitoring painful:
Token opacity. The LLM API tells you the completion — it does not tell you how many tokens were consumed, what the per-request cost was, or whether that cost is trending up. OTel's counters and histograms make this visible.
Cross-service traces. A real LLM application does more than call the model. It retrieves context from a vector database, runs retrieval pipelines, applies guardrails, and formats responses. OTel traces let you see the full request lifecycle.
Ecosystem integration. Prometheus scrapes OTel metrics natively. Grafana has built-in OTel data source support. The OTel Collector speaks S3, Kafka, and every major backend. You are not locked into any particular vendor.
3. Architecture Overview
The stack has four layers:
┌─────────────────────────────────────────────────────┐
│ Your LLM Application (Python/Node) │
│ OpenTelemetry SDK auto-instruments LLM calls │
└────────────────┬────────────────────────────────────┘
│ OTLP (gRPC/HTTP)
┌────────────────▼────────────────────────────────────┐
│ OpenTelemetry Collector │
│ Receives traces + metrics, processes and exports │
└────────────────┬────────────────────────────────────┘
│
┌────────────┴────────────┐
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│Prometheus│ │ Tempo │
│(metrics) │ │ (traces) │
└────┬─────┘ └────┬─────┘
│ │
└───────────┬───────────┘
▼
┌──────────┐
│ Grafana │
│ Dashboards│
└──────────┘ This architecture keeps all telemetry data inside your infrastructure. No third-party SDKs, no data leaving your network.
4. Prerequisites
You need:
- Python 3.9+ (we'll use the OpenTelemetry Python SDK)
- A running LLM application (OpenAI API, Anthropic, or local vLLM)
- Docker and Docker Compose (for Prometheus + Grafana)
- Optional: Kubernetes (all concepts apply to K8s deployments)
5. Step 1 — Install OpenTelemetry SDK
The OpenTelemetry Python SDK supports auto-instrumentation for openai, anthropic, and litellm libraries with zero code changes. You install the packages, set environment variables, and your LLM calls are instrumented automatically.
# Core OTel SDK
pip install opentelemetry-api \
opentelemetry-sdk \
opentelemetry-exporter-otlp
# Auto-instrumentation for Python apps
pip install opentelemetry-instrumentation-openai \
opentelemetry-instrumentation-anthropic
# Resource semantic conventions (for adding LLM-specific attributes)
pip install opentelemetry-semantic-conventions Auto-instrumentation runs as a wrapper around your existing application. You do not modify your code:
opentelemetry-instrument \
--service_name "my-llm-app" \
--exporter_otlp_endpoint "http://localhost:4317" \
python your_app.py 6. Step 2 — Define LLM-Specific Metrics
Auto-instrumentation captures spans, but for a complete LLM monitoring setup you want custom metrics that make sense for language models. Add these to your application:
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
# Create a resource with your service identity
resource = Resource.create({"service.name": "llm-chatbot"})
# Set up the meter (creates metrics)
meter = metrics.get_meter("llm_monitor")
# Token counters
llm_tokens_in = meter.create_counter(
"llm.tokens.input",
description="Total input tokens consumed",
unit="1"
)
llm_tokens_out = meter.create_counter(
"llm.tokens.output",
description="Total output tokens generated",
unit="1"
)
# Cost histogram (in USD)
llm_cost = meter.create_histogram(
"llm.cost.usd",
description="LLM inference cost in USD",
unit="USD"
)
# Latency histogram
llm_latency = meter.create_histogram(
"llm.latency.ms",
description="LLM inference latency in milliseconds",
unit="ms"
)
# Quality signal: hallucination score (0-1)
llm_quality = meter.create_gauge(
"llm.hallucination.score",
description="Hallucination probability score (0=clean, 1=likely hallucination)",
unit="1"
) Record metrics inside your inference call:
import time
def call_llm(prompt: str, model: str = "gpt-4") -> dict:
start = time.monotonic()
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
latency_ms = (time.monotonic() - start) * 1000
usage = response["usage"]
cost = calculate_cost(usage["prompt_tokens"], usage["completion_tokens"], model)
# Record metrics
llm_tokens_in.add(usage["prompt_tokens"], {"model": model})
llm_tokens_out.add(usage["completion_tokens"], {"model": model})
llm_cost.record(cost, {"model": model})
llm_latency.record(latency_ms, {"model": model})
return response
def calculate_cost(prompt_tokens: int, completion_tokens: int, model: str) -> float:
# GPT-4 pricing as of 2026
pricing = {
"gpt-4": (0.00003, 0.00006), # ($/token) prompt, completion
"gpt-4o": (0.000005, 0.000015),
"gpt-3.5-turbo": (0.0000015, 0.000002),
}
p, c = pricing.get(model, (0.0, 0.0))
return (prompt_tokens * p) + (completion_tokens * c) 7. Step 3 — Add Semantic Conventions for LLM Spans
OpenTelemetry's semantic conventions standardize span and metric attribute names. Use the LLM conventions to add model and request context:
from opentelemetry.trace import SpanKind, Status, StatusCode
def wrap_llm_span(span_name: str, model: str, prompt_tokens: int):
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span(
span_name,
kind=SpanKind.CLIENT,
attributes={
"gen_ai.system": "openai",
"gen_ai.request.model": model,
"gen_ai.request.max_tokens": 2048,
"llm.prompt_tokens": prompt_tokens,
}
) as span:
yield span
# After the call, update span with response attributes
span.set_attribute("gen_ai.response.id", response["id"])
span.set_attribute("gen_ai.response.model", response["model"])
span.set_attribute("gen_ai.choice.finish_reason", response["choices"][0]["finish_reason"])
span.set_status(Status(StatusCode.OK)) The key gen_ai.* attributes follow the OTel semantic conventions for Generative AI. They enable Grafana dashboards to filter and group by model, system, and operation — critical for multi-model deployments.
8. Step 4 — Deploy the OpenTelemetry Collector
The OTel Collector receives telemetry from your application and exports it to your backends. Deploy it as a Docker sidecar or Kubernetes DaemonSet:
# otel-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: llm-otel-collector
spec:
mode: daemonset
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "llm"
const_labels:
service: llm-monitor
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus] The Collector exposes a Prometheus metrics endpoint (8889) that Prometheus scrapes. This decouples your application from the metrics backend — if Prometheus is down, your app keeps running and buffers data.
9. Step 5 — Configure Prometheus to Scrape OTel Metrics
Add a scrape config to Prometheus:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: 'llm-app' If you are on Kubernetes, use the Prometheus Operator with a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: otel-collector
labels:
release: prometheus
spec:
selector:
matchLabels:
app: opentelemetry
endpoints:
- port: prometheus
path: /metrics Grafana Cloud Pro free 14-day trial, no credit card required.
10. Step 6 — Build the Grafana Dashboard
Now visualize the data. Create a Grafana dashboard with four panels:
Panel 1: Token Usage Over Time
# Input tokens rate (per minute)
rate(llm_tokens_input_total[5m])
# Output tokens rate (per minute)
rate(llm_tokens_output_total[5m]) Group by model to see per-model breakdown. Spike in input tokens is often the first signal of a prompt engineering experiment going wrong.
Panel 2: LLM Cost Accumulation
# Cumulative cost in USD
sum(increase(llm_cost_usd_sum[1h]))
# Cost per model
sum by (model) (increase(llm_cost_usd_sum[1h])) Set a budget alert: sum(llm_cost_usd_sum) > 100 triggers when daily spend crosses a threshold.
Panel 3: Latency Distribution (P50/P95/P99)
# Latency percentiles
histogram_quantile(0.50, rate(llm_latency_ms_bucket[5m]))
histogram_quantile(0.95, rate(llm_latency_ms_bucket[5m]))
histogram_quantile(0.99, rate(llm_latency_ms_bucket[5m])) High P99 latency with normal P50 tells you about tail-end issues — often GPU memory pressure or KV cache thrashing in vLLM.
Panel 4: Error Rate
# LLM API errors (depends on how you record errors)
rate(llm_errors_total[5m])
# Rate-limited responses
rate(llm_rateLimited_total[5m]) Grafana Dashboard JSON Template
Import this JSON into Grafana to get a pre-built dashboard:
{
"dashboard": {
"title": "LLM Monitoring Stack",
"uid": "llm-monitoring",
"panels": [
{
"title": "Token Usage (Input vs Output)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "rate(llm_tokens_input_total[5m])",
"legendFormat": "Input tokens/min"
},
{
"expr": "rate(llm_tokens_output_total[5m])",
"legendFormat": "Output tokens/min"
}
]
},
{
"title": "LLM Cost ($/hour)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "sum(rate(llm_cost_usd_sum[5m])) * 3600",
"legendFormat": "Estimated $/hour"
}
]
},
{
"title": "Latency Percentiles",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [
{
"expr": "histogram_quantile(0.50, rate(llm_latency_ms_bucket[5m]))",
"legendFormat": "P50"
},
{
"expr": "histogram_quantile(0.95, rate(llm_latency_ms_bucket[5m]))",
"legendFormat": "P95"
},
{
"expr": "histogram_quantile(0.99, rate(llm_latency_ms_bucket[5m]))",
"legendFormat": "P99"
}
]
},
{
"title": "Requests by Model",
"type": "piechart",
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 8},
"targets": [
{
"expr": "sum by (model) (rate(llm_tokens_input_total[5m]))",
"legendFormat": "{{model}}"
}
]
}
]
}
} Save this as grafana-llm-dashboard.json and import it via Grafana UI → Dashboards → Import.
11. Step 7 — Add Alerting Rules
Prometheus alerting rules catch problems before users do. Add to your alerts.yml:
groups:
- name: llm_alerts
rules:
# Token budget breach
- alert: LLMTokensOverBudget
expr: increase(llm_tokens_input_total[1h]) > 1000000
for: 1m
labels:
severity: warning
annotations:
summary: "High token usage detected"
description: "LLM input tokens grew by {{ $value | humanize }} in the last hour."
# Latency spike
- alert: LLMLatencyHigh
expr: histogram_quantile(0.99, rate(llm_latency_ms_bucket[5m])) > 5000
for: 3m
labels:
severity: critical
annotations:
summary: "LLM P99 latency above 5s"
description: "P99 latency is {{ $value | humanizeDuration }} — likely GPU queueing or model overload."
# Cost overrun
- alert: LLM_cost_overrun
expr: increase(llm_cost_usd_sum[1h]) > 50
for: 5m
labels:
severity: warning
annotations:
summary: "LLM cost rate exceeds $50/hour"
description: "Current spend rate: ${{ $value | humanize }}/hour. Review prompt efficiency or model routing."
} Route these alerts to PagerDuty, Slack, or any webhook via Alertmanager.
12. Adding Traces: OpenTelemetry Tempo Integration
Metrics tell you that something is wrong. Traces tell you where. For LLM applications, traces are critical because a single user request can trigger multiple model calls, a vector search, a retrieval step, and a formatting pass — all running concurrently.
Enable trace context propagation in your LLM calls:
from opentelemetry import trace
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
propagator = TraceContextTextMapPropagator()
def call_llm_with_trace(prompt: str, context: dict):
# Extract any incoming trace context (from an API request, for example)
ctx = propagator.extract(carrier=context.get("headers", {}))
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span(
"llm.generation",
context=ctx,
kind=SpanKind.CLIENT,
attributes={
"gen_ai.system": "openai",
"gen_ai.request.model": "gpt-4o",
"gen_ai.request.max_tokens": 1024,
}
) as span:
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
span.set_attribute("gen_ai.response.id", response["id"])
span.set_attribute("gen_ai.response.finish_reason",
response["choices"][0]["finish_reason"])
span.set_attribute("gen_ai.usage.prompt_tokens",
response["usage"]["prompt_tokens"])
span.set_attribute("gen_ai.usage.completion_tokens",
response["usage"]["completion_tokens"])
return response The trace context flows into Grafana Tempo. In Grafana, click any span to see the full request timeline — including how long the vector retrieval took vs. the actual model inference.
13. Multi-Model Routing: A Key LLM Monitoring Pattern
A mature LLM stack rarely runs on a single model. Production systems route requests based on task complexity:
| Task | Model | Cost |
|---|---|---|
| Simple classification | GPT-3.5-Turbo | $0.001/1K tokens |
| Standard Q&A | GPT-4o | $0.015/1K tokens |
| Complex reasoning | GPT-4 | $0.06/1K tokens |
Track routing decisions with OTel attributes:
def route_and_call_llm(task: str, prompt: str) -> dict:
# Route to appropriate model
if task == "classify":
model = "gpt-3.5-turbo"
elif task == "reason":
model = "gpt-4"
else:
model = "gpt-4o"
with tracer.start_as_current_span("llm.route") as span:
span.set_attribute("llm.routed_model", model)
span.set_attribute("llm.task_type", task)
return call_llm(prompt, model=model) In Grafana, filter your token and cost panels by llm.task_type to understand spend distribution by use case — and identify opportunities to route more requests to cheaper models.
14. Scaling Considerations
For high-throughput LLM applications (100+ requests/second), the monitoring stack needs to keep up:
Collector scaling. Run the OTel Collector as a Kubernetes Deployment with HPA (Horizontal Pod Autoscaler) targeting CPU > 70%. For >1000 RPS, consider the OTel Collector's gateway mode with a load balancer.
Prometheus resource planning. A busy LLM service emits metrics on every request. At 100 RPS with 1 metric per call, Prometheus needs to handle 100 new time series per second. Use rate() queries over 5-minute windows to keep Grafana responsive.
Cardinality management. Never add high-cardinality attributes (user IDs, request IDs) as metric labels. Use traces for per-request detail, metrics for aggregate health. A metric label with 10,000 possible values will kill Prometheus.
15. What to Monitor Next
This tutorial covers the foundation. From here, expand your observability stack in three directions:
Hallucination detection. Add an evaluation layer that cross-checks model outputs against ground truth at scale. Arize Phoenix (open source) and LangSmith both integrate with OTel for continuous evaluation.
vLLM-specific metrics. If you are running vLLM as your inference server, instrument the vllm/ metrics prefix (GPU utilization, KV cache hit rate, batch scheduler latency) via the Prometheus endpoint that vLLM exposes on port 8000.
Fine-tuning cost tracking. Monitor dataset tokenization costs and training job resource usage — they dwarf inference costs at scale.
16. Conclusion
OpenTelemetry, Prometheus, and Grafana give you enterprise-grade LLM observability without enterprise-grade cost or vendor lock-in. The stack handles token counting, cost tracking, latency analysis, error monitoring, and trace-level debugging — everything you need to run LLM applications with confidence.
Start with auto-instrumentation, add custom metrics for token and cost tracking, deploy the OTel Collector, and import the Grafana dashboard. You will have a working stack in an afternoon.
The signals you capture today become the alerts that prevent incidents tomorrow.