Why LLM Monitoring Is Different from Traditional APM
Application Performance Monitoring (APM) tools were built for deterministic systems. A REST API call comes in, a database query runs, a response goes out. Latency is measurable, errors are countable, and the call graph is finite.
LLM applications break these tools. A single LLM call can invoke millions of parameters, consume 10,000 tokens of context, and take 30 seconds to complete. Token usage varies per request. Model providers bill at per-token rates. Prompt templates change independently from application code. And the quality of outputs - not just their speed - determines whether your product works.
Standard APM gives you: latency percentiles, error rates, CPU/memory utilization. LLM monitoring needs: token consumption by model, prompt version tracking, context window saturation, output quality metrics, cost per completion, and semantic drift detection.
You need a monitoring stack purpose-built for language models. This guide builds it from scratch.
Architecture Overview
The stack has four layers:
- Instrumentation: OpenTelemetry SDK in your application - captures traces, metrics, and logs automatically
- Collection: OTel Collector - batches and forwards telemetry to downstream systems
- Storage: Prometheus for metrics, Tempo for traces - both open source, both integrate with Grafana
- Visualization: Grafana dashboards for token costs, latency, error rates, and quality signals
┌─────────────────────────────────────────────────────┐
│ Your LLM Application (Python/Node) │
│ OpenTelemetry SDK auto-instruments LLM calls │
└────────────────┬────────────────────────────────────┘
│ OTLP (gRPC/HTTP)
┌────────────────▼────────────────────────────────────┐
│ OpenTelemetry Collector │
│ Receives traces + metrics, processes and exports │
└────────────────┬────────────────────────────────────┘
│
┌────────────┴────────────┐
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│Prometheus│ │ Tempo │
│(metrics) │ │ (traces) │
└────┬─────┘ └────┬─────┘
│ │
└───────────┬───────────┘
▼
┌──────────┐
│ Grafana │
│ Dashboards│
└──────────┘
This architecture keeps all telemetry data inside your infrastructure. No third-party SDKs, no data leaving your network.
Prerequisites
- Python 3.9+ (we'll use the OpenTelemetry Python SDK)
- A running LLM application (OpenAI API, Anthropic, or local vLLM)
- Docker and Docker Compose (for Prometheus + Grafana)
- Optional: Kubernetes (all concepts apply to K8s deployments)
Step 1 - Install OpenTelemetry SDK
The OpenTelemetry Python SDK supports auto-instrumentation for openai, anthropic, and litellm libraries with zero code changes. You install the packages, set environment variables, and your LLM calls are instrumented automatically.
# Core OTel SDK
pip install opentelemetry-api \
opentelemetry-sdk \
opentelemetry-exporter-otlp
# Auto-instrumentation for Python apps
pip install opentelemetry-instrumentation-openai \
opentelemetry-instrumentation-anthropic
# Resource semantic conventions (for adding LLM-specific attributes)
pip install opentelemetry-semantic-conventions Auto-instrumentation runs as a wrapper around your existing application. You do not modify your code:
opentelemetry-instrument \
--service_name "my-llm-app" \
--exporter_otlp_endpoint "http://localhost:4317" \
python your_app.py Step 2 - Define LLM-Specific Metrics
Auto-instrumentation captures spans, but for a complete LLM monitoring setup you want custom metrics that make sense for language models. Add these to your application:
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
# Create a resource with your service identity
resource = Resource.create({"service.name": "llm-chatbot"})
# Set up the meter (creates metrics)
meter = metrics.get_meter("llm_monitor")
# Token counters
llm_tokens_in = meter.create_counter(
"llm.tokens.input",
description="Total input tokens consumed",
unit="1"
)
llm_tokens_out = meter.create_counter(
"llm.tokens.output",
description="Total output tokens generated",
unit="1"
)
# Cost histogram (in USD)
llm_cost = meter.create_histogram(
"llm.cost.usd",
description="LLM inference cost in USD",
unit="USD"
)
# Latency histogram
llm_latency = meter.create_histogram(
"llm.latency.ms",
description="LLM inference latency in milliseconds",
unit="ms"
)
# Quality signal: hallucination score (0-1)
llm_quality = meter.create_gauge(
"llm.hallucination.score",
description="Hallucination probability score (0=clean, 1=likely hallucination)",
unit="1"
) Record metrics inside your inference call:
import time
def call_llm(prompt: str, model: str = "gpt-4") -> dict:
start = time.monotonic()
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
latency_ms = (time.monotonic() - start) * 1000
usage = response["usage"]
cost = calculate_cost(usage["prompt_tokens"], usage["completion_tokens"], model)
# Record metrics
llm_tokens_in.add(usage["prompt_tokens"], {"model": model})
llm_tokens_out.add(usage["completion_tokens"], {"model": model})
llm_cost.record(cost, {"model": model})
llm_latency.record(latency_ms, {"model": model})
return response
def calculate_cost(prompt_tokens: int, completion_tokens: int, model: str) -> float:
# GPT-4 pricing as of 2026
pricing = {
"gpt-4": (0.00003, 0.00006), # ($/token) prompt, completion
"gpt-4o": (0.000005, 0.000015),
"gpt-3.5-turbo": (0.0000015, 0.000002),
}
p, c = pricing.get(model, (0.0, 0.0))
return (prompt_tokens * p) + (completion_tokens * c) Step 3 - Add Semantic Conventions for LLM Spans
OpenTelemetry's semantic conventions standardize span and metric attribute names. Use the LLM conventions to add model and request context:
from opentelemetry.trace import SpanKind, Status, StatusCode
def wrap_llm_span(span_name: str, model: str, prompt_tokens: int):
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span(
span_name,
kind=SpanKind.CLIENT,
attributes={
"gen_ai.system": "openai",
"gen_ai.request.model": model,
"gen_ai.request.max_tokens": 2048,
"llm.prompt_tokens": prompt_tokens,
}
) as span:
yield span
# After the call, update span with response attributes
span.set_attribute("gen_ai.response.id", response["id"])
span.set_attribute("gen_ai.response.model", response["model"])
span.set_attribute("gen_ai.choice.finish_reason", response["choices"][0]["finish_reason"])
span.set_status(Status(StatusCode.OK)) The key gen_ai.* attributes follow the OTel semantic conventions for Generative AI. They enable Grafana dashboards to filter and group by model, system, and operation - critical for multi-model deployments.
Grafana Cloud gives you hosted Prometheus, Grafana dashboards, and alerting - no infrastructure to manage. Import the dashboard JSON from this tutorial and monitor token usage, cost, and latency in minutes.
Step 4 - Deploy the OpenTelemetry Collector
The OTel Collector receives telemetry from your application and exports it to your backends. Deploy it as a Docker sidecar or Kubernetes DaemonSet:
# otel-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: llm-otel-collector
spec:
mode: daemonset
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "llm"
const_labels:
service: llm-monitor
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus] The Collector exposes a Prometheus metrics endpoint (8889) that Prometheus scrapes. This decouples your application from the metrics backend - if Prometheus is down, your app keeps running and buffers data.
Step 5 - Configure Prometheus to Scrape OTel Metrics
Add a scrape config to Prometheus:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: 'llm-app' If you are on Kubernetes, use the Prometheus Operator with a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: otel-collector
labels:
release: prometheus
spec:
selector:
matchLabels:
app: opentelemetry
endpoints:
- port: prometheus
path: /metrics Step 6 - Build the Grafana Dashboard
Now visualize the data. Create a Grafana dashboard with four panels:
Panel 1: Token Usage Over Time
# Input tokens rate (per minute)
rate(llm_tokens_input_total[5m])
# Output tokens rate (per minute)
rate(llm_tokens_output_total[5m]) Group by model to see per-model breakdown. Spike in input tokens is often the first signal of a prompt engineering experiment going wrong.
Panel 2: LLM Cost Accumulation
# Cumulative cost in USD
sum(increase(llm_cost_usd_sum[1h]))
# Cost per model
sum by (model) (increase(llm_cost_usd_sum[1h])) Set a budget alert: sum(llm_cost_usd_sum) > 100 triggers when daily spend crosses a threshold.
Panel 3: Latency Distribution (P50/P95/P99)
# Latency percentiles
histogram_quantile(0.50, rate(llm_latency_ms_bucket[5m]))
histogram_quantile(0.95, rate(llm_latency_ms_bucket[5m]))
histogram_quantile(0.99, rate(llm_latency_ms_bucket[5m])) High P99 latency with normal P50 tells you about tail-end issues - often GPU memory pressure or KV cache thrashing in vLLM.
Panel 4: Error Rate
# LLM API errors (depends on how you record errors)
rate(llm_errors_total[5m])
# Rate-limited responses
rate(llm_rateLimited_total[5m]) Grafana Dashboard JSON Template
Import this JSON into Grafana to get a pre-built dashboard:
{
"dashboard": {
"title": "LLM Monitoring Stack",
"uid": "llm-monitoring",
"panels": [
{
"title": "Token Usage (Input vs Output)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "rate(llm_tokens_input_total[5m])",
"legendFormat": "Input tokens/min"
},
{
"expr": "rate(llm_tokens_output_total[5m])",
"legendFormat": "Output tokens/min"
}
]
},
{
"title": "LLM Cost ($/hour)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "sum(rate(llm_cost_usd_sum[5m])) * 3600",
"legendFormat": "Estimated $/hour"
}
]
},
{
"title": "Latency Percentiles",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [
{
"expr": "histogram_quantile(0.50, rate(llm_latency_ms_bucket[5m]))",
"legendFormat": "P50"
},
{
"expr": "histogram_quantile(0.95, rate(llm_latency_ms_bucket[5m]))",
"legendFormat": "P95"
},
{
"expr": "histogram_quantile(0.99, rate(llm_latency_ms_bucket[5m]))",
"legendFormat": "P99"
}
]
},
{
"title": "Requests by Model",
"type": "piechart",
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 8},
"targets": [
{
"expr": "sum by (model) (rate(llm_tokens_input_total[5m]))",
"legendFormat": "{{model}}"
}
]
}
]
}
} Save this as grafana-llm-dashboard.json and import it via Grafana UI → Dashboards → Import.
Step 7 - Add Alerting Rules
Prometheus alerting rules catch problems before users do. Add to your alerts.yml:
groups:
- name: llm_alerts
rules:
# Token budget breach
- alert: LLMTokensOverBudget
expr: increase(llm_tokens_input_total[1h]) > 1000000
for: 1m
labels:
severity: warning
annotations:
summary: "High token usage detected"
description: "LLM input tokens grew by {{ $value | humanize }} in the last hour."
# Latency spike
- alert: LLMLatencyHigh
expr: histogram_quantile(0.99, rate(llm_latency_ms_bucket[5m])) > 5000
for: 3m
labels:
severity: critical
annotations:
summary: "LLM P99 latency above 5s"
description: "P99 latency is {{ $value | humanizeDuration }} - likely GPU queueing or model overload."
# Cost overrun
- alert: LLM_cost_overrun
expr: increase(llm_cost_usd_sum[1h]) > 50
for: 5m
labels:
severity: warning
annotations:
summary: "LLM cost rate exceeds $50/hour"
description: "Current spend rate: ${{ $value | humanize }}/hour. Review prompt efficiency or model routing." Route these alerts to PagerDuty, Slack, or any webhook via Alertmanager.
Adding Traces: OpenTelemetry Tempo Integration
Metrics tell you that something is wrong. Traces tell you where. For LLM applications, traces are critical because a single user request can trigger multiple model calls, a vector search, a retrieval step, and a formatting pass - all running concurrently.
Enable trace context propagation in your LLM calls:
from opentelemetry import trace
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
propagator = TraceContextTextMapPropagator()
def call_llm_with_trace(prompt: str, context: dict):
# Extract any incoming trace context (from an API request, for example)
ctx = propagator.extract(carrier=context.get("headers", {}))
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span(
"llm.generation",
context=ctx,
kind=SpanKind.CLIENT,
attributes={
"gen_ai.system": "openai",
"gen_ai.request.model": "gpt-4o",
"gen_ai.request.max_tokens": 1024,
}
) as span:
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
span.set_attribute("gen_ai.response.id", response["id"])
span.set_attribute("gen_ai.response.finish_reason",
response["choices"][0]["finish_reason"])
span.set_attribute("gen_ai.usage.prompt_tokens",
response["usage"]["prompt_tokens"])
span.set_attribute("gen_ai.usage.completion_tokens",
response["usage"]["completion_tokens"])
return response The trace context flows into Grafana Tempo. In Grafana, click any span to see the full request timeline - including how long the vector retrieval took vs. the actual model inference.
Multi-Model Routing: A Key LLM Monitoring Pattern
A mature LLM stack rarely runs on a single model. Production systems route requests based on task complexity:
| Task | Model | Cost |
|---|---|---|
| Simple classification | GPT-3.5-Turbo | $0.001/1K tokens |
| Standard Q&A | GPT-4o | $0.015/1K tokens |
| Complex reasoning | GPT-4 | $0.06/1K tokens |
Track routing decisions with OTel attributes:
def route_and_call_llm(task: str, prompt: str) -> dict:
# Route to appropriate model
if task == "classify":
model = "gpt-3.5-turbo"
elif task == "reason":
model = "gpt-4"
else:
model = "gpt-4o"
with tracer.start_as_current_span("llm.route") as span:
span.set_attribute("llm.routed_model", model)
span.set_attribute("llm.task_type", task)
return call_llm(prompt, model=model) In Grafana, filter your token and cost panels by llm.task_type to understand spend distribution by use case - and identify opportunities to route more requests to cheaper models.
Scaling Considerations
For high-throughput LLM applications (100+ requests/second), the monitoring stack needs to keep up:
Collector scaling. Run the OTel Collector as a Kubernetes Deployment with HPA (Horizontal Pod Autoscaler) targeting CPU > 70%. For >1000 RPS, consider the OTel Collector's gateway mode with a load balancer.
Prometheus resource planning. A busy LLM service emits metrics on every request. At 100 RPS with 1 metric per call, Prometheus needs to handle 100 new time series per second. Use rate() queries over 5-minute windows to keep Grafana responsive.
Cardinality management. Never add high-cardinality attributes (user IDs, request IDs) as metric labels. Use traces for per-request detail, metrics for aggregate health. A metric label with 10,000 possible values will kill Prometheus.
OTLP batching. Configure the OTel SDK to batch spans and metrics before sending:
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
exporter = OTLPMetricExporter(insecure=True)
reader = PeriodicExportingMetricReader(exporter, export_interval_millis=5000)
meter_provider = MeterProvider(resource=resource, metric_readers=[reader]) What to Monitor Next
This tutorial covers the foundation. From here, expand your observability stack in three directions:
Hallucination detection. Add an evaluation layer that cross-checks model outputs against ground truth at scale. Arize Phoenix (open source) and LangSmith both integrate with OTel for continuous evaluation.
vLLM-specific metrics. If you are running vLLM as your inference server, instrument the vllm/ metrics prefix (GPU utilization, KV cache hit rate, batch scheduler latency) via the Prometheus endpoint that vLLM exposes on port 8000.
Fine-tuning cost tracking. Monitor dataset tokenization costs and training job resource usage - they dwarf inference costs at scale.
Conclusion
OpenTelemetry, Prometheus, and Grafana give you enterprise-grade LLM observability without enterprise-grade cost or vendor lock-in. The stack handles token counting, cost tracking, latency analysis, error monitoring, and trace-level debugging - everything you need to run LLM applications with confidence.
Start with auto-instrumentation, add custom metrics for token and cost tracking, deploy the OTel Collector, and import the Grafana dashboard. You will have a working stack in an afternoon.
The signals you capture today become the alerts that prevent incidents tomorrow.