Build Your First LLM Monitoring Stack: OTel + Prometheus

1. Introduction

Every production LLM application needs observability. Not optional, not "nice to have" — essential. Token costs compound silently, latency spikes destroy user experience, and model hallucinations propagate before anyone catches them.

The standard approach is to reach for a commercial observability platform. And that works — until you hit per-seat pricing at scale, vendor lock-in on your telemetry data, or compliance requirements that keep your metrics inside your own infrastructure.

This tutorial shows you how to build a complete, open-source LLM monitoring stack using OpenTelemetry for instrumentation, Prometheus for metrics collection, and Grafana for visualization. It costs nothing to run, keeps all data in your own environment, and connects directly to the broader Cloud-Native ecosystem.

By the end you will have: auto-instrumented LLM calls with token counting, latency histograms, cost calculation, error tracking, and a Grafana dashboard that shows your LLM health at a glance.

2. Why OpenTelemetry for LLM Monitoring?

OpenTelemetry (OTel) has become the de facto standard for observability instrumentation in cloud-native applications. The project combines tracing, metrics, and logging under a single vendor-neutral SDK with auto-instrumentation agents that require zero code changes for common frameworks.

For LLM applications, OTel solves three problems that make native LLM API monitoring painful:

Token opacity. The LLM API tells you the completion — it does not tell you how many tokens were consumed, what the per-request cost was, or whether that cost is trending up. OTel's counters and histograms make this visible.

Cross-service traces. A real LLM application does more than call the model. It retrieves context from a vector database, runs retrieval pipelines, applies guardrails, and formats responses. OTel traces let you see the full request lifecycle.

Ecosystem integration. Prometheus scrapes OTel metrics natively. Grafana has built-in OTel data source support. The OTel Collector speaks S3, Kafka, and every major backend. You are not locked into any particular vendor.

3. Architecture Overview

The stack has four layers:

┌─────────────────────────────────────────────────────┐
│  Your LLM Application (Python/Node)                 │
│  OpenTelemetry SDK auto-instruments LLM calls      │
└────────────────┬────────────────────────────────────┘
                 │ OTLP (gRPC/HTTP)
┌────────────────▼────────────────────────────────────┐
│  OpenTelemetry Collector                          │
│  Receives traces + metrics, processes and exports   │
└────────────────┬────────────────────────────────────┘
                 │
    ┌────────────┴────────────┐
    │                         │
    ▼                         ▼
┌──────────┐            ┌──────────┐
│Prometheus│            │  Tempo   │
│(metrics) │            │ (traces) │
└────┬─────┘            └────┬─────┘
     │                       │
     └───────────┬───────────┘
                 ▼
          ┌──────────┐
          │ Grafana  │
          │ Dashboards│
          └──────────┘

This architecture keeps all telemetry data inside your infrastructure. No third-party SDKs, no data leaving your network.

4. Prerequisites

You need:

Python 3.9+ (we'll use the OpenTelemetry Python SDK)
A running LLM application (OpenAI API, Anthropic, or local vLLM)
Docker and Docker Compose (for Prometheus + Grafana)
Optional: Kubernetes (all concepts apply to K8s deployments)

5. Step 1 — Install OpenTelemetry SDK

The OpenTelemetry Python SDK supports auto-instrumentation for openai, anthropic, and litellm libraries with zero code changes. You install the packages, set environment variables, and your LLM calls are instrumented automatically.

# Core OTel SDK
pip install opentelemetry-api \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp

# Auto-instrumentation for Python apps
pip install opentelemetry-instrumentation-openai \
  opentelemetry-instrumentation-anthropic

# Resource semantic conventions (for adding LLM-specific attributes)
pip install opentelemetry-semantic-conventions

Auto-instrumentation runs as a wrapper around your existing application. You do not modify your code:

opentelemetry-instrument \
  --service_name "my-llm-app" \
  --exporter_otlp_endpoint "http://localhost:4317" \
  python your_app.py

6. Step 2 — Define LLM-Specific Metrics

Auto-instrumentation captures spans, but for a complete LLM monitoring setup you want custom metrics that make sense for language models. Add these to your application:

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource

# Create a resource with your service identity
resource = Resource.create({"service.name": "llm-chatbot"})

# Set up the meter (creates metrics)
meter = metrics.get_meter("llm_monitor")

# Token counters
llm_tokens_in = meter.create_counter(
    "llm.tokens.input",
    description="Total input tokens consumed",
    unit="1"
)

llm_tokens_out = meter.create_counter(
    "llm.tokens.output",
    description="Total output tokens generated",
    unit="1"
)

# Cost histogram (in USD)
llm_cost = meter.create_histogram(
    "llm.cost.usd",
    description="LLM inference cost in USD",
    unit="USD"
)

# Latency histogram
llm_latency = meter.create_histogram(
    "llm.latency.ms",
    description="LLM inference latency in milliseconds",
    unit="ms"
)

# Quality signal: hallucination score (0-1)
llm_quality = meter.create_gauge(
    "llm.hallucination.score",
    description="Hallucination probability score (0=clean, 1=likely hallucination)",
    unit="1"
)

Record metrics inside your inference call:

import time

def call_llm(prompt: str, model: str = "gpt-4") -> dict:
    start = time.monotonic()
    response = openai.ChatCompletion.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    latency_ms = (time.monotonic() - start) * 1000

    usage = response["usage"]
    cost = calculate_cost(usage["prompt_tokens"], usage["completion_tokens"], model)

    # Record metrics
    llm_tokens_in.add(usage["prompt_tokens"], {"model": model})
    llm_tokens_out.add(usage["completion_tokens"], {"model": model})
    llm_cost.record(cost, {"model": model})
    llm_latency.record(latency_ms, {"model": model})

    return response

def calculate_cost(prompt_tokens: int, completion_tokens: int, model: str) -> float:
    # GPT-4 pricing as of 2026
    pricing = {
        "gpt-4": (0.00003, 0.00006),   # ($/token) prompt, completion
        "gpt-4o": (0.000005, 0.000015),
        "gpt-3.5-turbo": (0.0000015, 0.000002),
    }
    p, c = pricing.get(model, (0.0, 0.0))
    return (prompt_tokens * p) + (completion_tokens * c)

7. Step 3 — Add Semantic Conventions for LLM Spans

OpenTelemetry's semantic conventions standardize span and metric attribute names. Use the LLM conventions to add model and request context:

from opentelemetry.trace import SpanKind, Status, StatusCode

def wrap_llm_span(span_name: str, model: str, prompt_tokens: int):
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span(
        span_name,
        kind=SpanKind.CLIENT,
        attributes={
            "gen_ai.system": "openai",
            "gen_ai.request.model": model,
            "gen_ai.request.max_tokens": 2048,
            "llm.prompt_tokens": prompt_tokens,
        }
    ) as span:
        yield span
        # After the call, update span with response attributes
        span.set_attribute("gen_ai.response.id", response["id"])
        span.set_attribute("gen_ai.response.model", response["model"])
        span.set_attribute("gen_ai.choice.finish_reason", response["choices"][0]["finish_reason"])
        span.set_status(Status(StatusCode.OK))

The key gen_ai.* attributes follow the OTel semantic conventions for Generative AI. They enable Grafana dashboards to filter and group by model, system, and operation — critical for multi-model deployments.

8. Step 4 — Deploy the OpenTelemetry Collector

The OTel Collector receives telemetry from your application and exports it to your backends. Deploy it as a Docker sidecar or Kubernetes DaemonSet:

# otel-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: llm-otel-collector
spec:
  mode: daemonset
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024

      memory_limiter:
        check_interval: 1s
        limit_mib: 512

    exporters:
      prometheus:
        endpoint: "0.0.0.0:8889"
        namespace: "llm"
        const_labels:
          service: llm-monitor

      otlp/tempo:
        endpoint: tempo:4317
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp/tempo]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [prometheus]

The Collector exposes a Prometheus metrics endpoint (8889) that Prometheus scrapes. This decouples your application from the metrics backend — if Prometheus is down, your app keeps running and buffers data.

9. Step 5 — Configure Prometheus to Scrape OTel Metrics

Add a scrape config to Prometheus:

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: 'llm-app'

If you are on Kubernetes, use the Prometheus Operator with a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: otel-collector
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: opentelemetry
  endpoints:
    - port: prometheus
      path: /metrics

Infrastructure Grafana Cloud Pro — Full-Stack LLM Observability

Grafana Cloud Pro free 14-day trial, no credit card required.

10. Step 6 — Build the Grafana Dashboard

Now visualize the data. Create a Grafana dashboard with four panels:

Panel 1: Token Usage Over Time

# Input tokens rate (per minute)
rate(llm_tokens_input_total[5m])

# Output tokens rate (per minute)
rate(llm_tokens_output_total[5m])

Group by model to see per-model breakdown. Spike in input tokens is often the first signal of a prompt engineering experiment going wrong.

Panel 2: LLM Cost Accumulation

# Cumulative cost in USD
sum(increase(llm_cost_usd_sum[1h]))

# Cost per model
sum by (model) (increase(llm_cost_usd_sum[1h]))

Set a budget alert: sum(llm_cost_usd_sum) > 100 triggers when daily spend crosses a threshold.

Panel 3: Latency Distribution (P50/P95/P99)

# Latency percentiles
histogram_quantile(0.50, rate(llm_latency_ms_bucket[5m]))
histogram_quantile(0.95, rate(llm_latency_ms_bucket[5m]))
histogram_quantile(0.99, rate(llm_latency_ms_bucket[5m]))

High P99 latency with normal P50 tells you about tail-end issues — often GPU memory pressure or KV cache thrashing in vLLM.

Panel 4: Error Rate

# LLM API errors (depends on how you record errors)
rate(llm_errors_total[5m])

# Rate-limited responses
rate(llm_rateLimited_total[5m])

Grafana Dashboard JSON Template

Import this JSON into Grafana to get a pre-built dashboard:

{
  "dashboard": {
    "title": "LLM Monitoring Stack",
    "uid": "llm-monitoring",
    "panels": [
      {
        "title": "Token Usage (Input vs Output)",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "rate(llm_tokens_input_total[5m])",
            "legendFormat": "Input tokens/min"
          },
          {
            "expr": "rate(llm_tokens_output_total[5m])",
            "legendFormat": "Output tokens/min"
          }
        ]
      },
      {
        "title": "LLM Cost ($/hour)",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(llm_cost_usd_sum[5m])) * 3600",
            "legendFormat": "Estimated $/hour"
          }
        ]
      },
      {
        "title": "Latency Percentiles",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "histogram_quantile(0.50, rate(llm_latency_ms_bucket[5m]))",
            "legendFormat": "P50"
          },
          {
            "expr": "histogram_quantile(0.95, rate(llm_latency_ms_bucket[5m]))",
            "legendFormat": "P95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(llm_latency_ms_bucket[5m]))",
            "legendFormat": "P99"
          }
        ]
      },
      {
        "title": "Requests by Model",
        "type": "piechart",
        "gridPos": {"h": 8, "w": 6, "x": 12, "y": 8},
        "targets": [
          {
            "expr": "sum by (model) (rate(llm_tokens_input_total[5m]))",
            "legendFormat": "{{model}}"
          }
        ]
      }
    ]
  }
}

Save this as grafana-llm-dashboard.json and import it via Grafana UI → Dashboards → Import.

11. Step 7 — Add Alerting Rules

Prometheus alerting rules catch problems before users do. Add to your alerts.yml:

groups:
  - name: llm_alerts
    rules:
      # Token budget breach
      - alert: LLMTokensOverBudget
        expr: increase(llm_tokens_input_total[1h]) > 1000000
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "High token usage detected"
          description: "LLM input tokens grew by {{ $value | humanize }} in the last hour."

      # Latency spike
      - alert: LLMLatencyHigh
        expr: histogram_quantile(0.99, rate(llm_latency_ms_bucket[5m])) > 5000
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "LLM P99 latency above 5s"
          description: "P99 latency is {{ $value | humanizeDuration }} — likely GPU queueing or model overload."

      # Cost overrun
      - alert: LLM_cost_overrun
        expr: increase(llm_cost_usd_sum[1h]) > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM cost rate exceeds $50/hour"
          description: "Current spend rate: ${{ $value | humanize }}/hour. Review prompt efficiency or model routing."
}

Route these alerts to PagerDuty, Slack, or any webhook via Alertmanager.

12. Adding Traces: OpenTelemetry Tempo Integration

Metrics tell you that something is wrong. Traces tell you where. For LLM applications, traces are critical because a single user request can trigger multiple model calls, a vector search, a retrieval step, and a formatting pass — all running concurrently.

Enable trace context propagation in your LLM calls:

from opentelemetry import trace
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

propagator = TraceContextTextMapPropagator()

def call_llm_with_trace(prompt: str, context: dict):
    # Extract any incoming trace context (from an API request, for example)
    ctx = propagator.extract(carrier=context.get("headers", {}))
    
    tracer = trace.get_tracer(__name__)
    
    with tracer.start_as_current_span(
        "llm.generation",
        context=ctx,
        kind=SpanKind.CLIENT,
        attributes={
            "gen_ai.system": "openai",
            "gen_ai.request.model": "gpt-4o",
            "gen_ai.request.max_tokens": 1024,
        }
    ) as span:
        response = openai.ChatCompletion.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )
        
        span.set_attribute("gen_ai.response.id", response["id"])
        span.set_attribute("gen_ai.response.finish_reason",
            response["choices"][0]["finish_reason"])
        span.set_attribute("gen_ai.usage.prompt_tokens",
            response["usage"]["prompt_tokens"])
        span.set_attribute("gen_ai.usage.completion_tokens",
            response["usage"]["completion_tokens"])
        
        return response

The trace context flows into Grafana Tempo. In Grafana, click any span to see the full request timeline — including how long the vector retrieval took vs. the actual model inference.

13. Multi-Model Routing: A Key LLM Monitoring Pattern

A mature LLM stack rarely runs on a single model. Production systems route requests based on task complexity:

Task	Model	Cost
Simple classification	GPT-3.5-Turbo	$0.001/1K tokens
Standard Q&A	GPT-4o	$0.015/1K tokens
Complex reasoning	GPT-4	$0.06/1K tokens

Track routing decisions with OTel attributes:

def route_and_call_llm(task: str, prompt: str) -> dict:
    # Route to appropriate model
    if task == "classify":
        model = "gpt-3.5-turbo"
    elif task == "reason":
        model = "gpt-4"
    else:
        model = "gpt-4o"

    with tracer.start_as_current_span("llm.route") as span:
        span.set_attribute("llm.routed_model", model)
        span.set_attribute("llm.task_type", task)
        
        return call_llm(prompt, model=model)

In Grafana, filter your token and cost panels by llm.task_type to understand spend distribution by use case — and identify opportunities to route more requests to cheaper models.

14. Scaling Considerations

For high-throughput LLM applications (100+ requests/second), the monitoring stack needs to keep up:

Collector scaling. Run the OTel Collector as a Kubernetes Deployment with HPA (Horizontal Pod Autoscaler) targeting CPU > 70%. For >1000 RPS, consider the OTel Collector's gateway mode with a load balancer.

Prometheus resource planning. A busy LLM service emits metrics on every request. At 100 RPS with 1 metric per call, Prometheus needs to handle 100 new time series per second. Use rate() queries over 5-minute windows to keep Grafana responsive.

Cardinality management. Never add high-cardinality attributes (user IDs, request IDs) as metric labels. Use traces for per-request detail, metrics for aggregate health. A metric label with 10,000 possible values will kill Prometheus.

15. What to Monitor Next

This tutorial covers the foundation. From here, expand your observability stack in three directions:

Hallucination detection. Add an evaluation layer that cross-checks model outputs against ground truth at scale. Arize Phoenix (open source) and LangSmith both integrate with OTel for continuous evaluation.

vLLM-specific metrics. If you are running vLLM as your inference server, instrument the vllm/ metrics prefix (GPU utilization, KV cache hit rate, batch scheduler latency) via the Prometheus endpoint that vLLM exposes on port 8000.

Fine-tuning cost tracking. Monitor dataset tokenization costs and training job resource usage — they dwarf inference costs at scale.

16. Conclusion

OpenTelemetry, Prometheus, and Grafana give you enterprise-grade LLM observability without enterprise-grade cost or vendor lock-in. The stack handles token counting, cost tracking, latency analysis, error monitoring, and trace-level debugging — everything you need to run LLM applications with confidence.

Start with auto-instrumentation, add custom metrics for token and cost tracking, deploy the OTel Collector, and import the Grafana dashboard. You will have a working stack in an afternoon.

The signals you capture today become the alerts that prevent incidents tomorrow.