Why LLM Monitoring Is Different from Traditional APM

Application Performance Monitoring (APM) tools were built for deterministic systems. A REST API call comes in, a database query runs, a response goes out. Latency is measurable, errors are countable, and the call graph is finite.

LLM applications break these tools. A single LLM call can invoke millions of parameters, consume 10,000 tokens of context, and take 30 seconds to complete. Token usage varies per request. Model providers bill at per-token rates. Prompt templates change independently from application code. And the quality of outputs - not just their speed - determines whether your product works.

Standard APM gives you: latency percentiles, error rates, CPU/memory utilization. LLM monitoring needs: token consumption by model, prompt version tracking, context window saturation, output quality metrics, cost per completion, and semantic drift detection.

You need a monitoring stack purpose-built for language models. This guide builds it from scratch.

Architecture Overview

The stack has four layers:

  • Instrumentation: OpenTelemetry SDK in your application - captures traces, metrics, and logs automatically
  • Collection: OTel Collector - batches and forwards telemetry to downstream systems
  • Storage: Prometheus for metrics, Tempo for traces - both open source, both integrate with Grafana
  • Visualization: Grafana dashboards for token costs, latency, error rates, and quality signals
┌─────────────────────────────────────────────────────┐
│  Your LLM Application (Python/Node)                 │
│  OpenTelemetry SDK auto-instruments LLM calls       │
└────────────────┬────────────────────────────────────┘
                 │ OTLP (gRPC/HTTP)
┌────────────────▼────────────────────────────────────┐
│  OpenTelemetry Collector                            │
│  Receives traces + metrics, processes and exports  │
└────────────────┬────────────────────────────────────┘
                 │
    ┌────────────┴────────────┐
    │                         │
    ▼                         ▼
┌──────────┐            ┌──────────┐
│Prometheus│            │  Tempo   │
│(metrics) │            │ (traces) │
└────┬─────┘            └────┬─────┘
     │                       │
     └───────────┬───────────┘
                 ▼
          ┌──────────┐
          │ Grafana  │
          │ Dashboards│
          └──────────┘

This architecture keeps all telemetry data inside your infrastructure. No third-party SDKs, no data leaving your network.

Prerequisites

  • Python 3.9+ (we'll use the OpenTelemetry Python SDK)
  • A running LLM application (OpenAI API, Anthropic, or local vLLM)
  • Docker and Docker Compose (for Prometheus + Grafana)
  • Optional: Kubernetes (all concepts apply to K8s deployments)

Step 1 - Install OpenTelemetry SDK

The OpenTelemetry Python SDK supports auto-instrumentation for openai, anthropic, and litellm libraries with zero code changes. You install the packages, set environment variables, and your LLM calls are instrumented automatically.

# Core OTel SDK
pip install opentelemetry-api \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp

# Auto-instrumentation for Python apps
pip install opentelemetry-instrumentation-openai \
  opentelemetry-instrumentation-anthropic

# Resource semantic conventions (for adding LLM-specific attributes)
pip install opentelemetry-semantic-conventions

Auto-instrumentation runs as a wrapper around your existing application. You do not modify your code:

opentelemetry-instrument \
  --service_name "my-llm-app" \
  --exporter_otlp_endpoint "http://localhost:4317" \
  python your_app.py
Advertisement
Advertisement

Step 2 - Define LLM-Specific Metrics

Auto-instrumentation captures spans, but for a complete LLM monitoring setup you want custom metrics that make sense for language models. Add these to your application:

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource

# Create a resource with your service identity
resource = Resource.create({"service.name": "llm-chatbot"})

# Set up the meter (creates metrics)
meter = metrics.get_meter("llm_monitor")

# Token counters
llm_tokens_in = meter.create_counter(
    "llm.tokens.input",
    description="Total input tokens consumed",
    unit="1"
)

llm_tokens_out = meter.create_counter(
    "llm.tokens.output",
    description="Total output tokens generated",
    unit="1"
)

# Cost histogram (in USD)
llm_cost = meter.create_histogram(
    "llm.cost.usd",
    description="LLM inference cost in USD",
    unit="USD"
)

# Latency histogram
llm_latency = meter.create_histogram(
    "llm.latency.ms",
    description="LLM inference latency in milliseconds",
    unit="ms"
)

# Quality signal: hallucination score (0-1)
llm_quality = meter.create_gauge(
    "llm.hallucination.score",
    description="Hallucination probability score (0=clean, 1=likely hallucination)",
    unit="1"
)

Record metrics inside your inference call:

import time

def call_llm(prompt: str, model: str = "gpt-4") -> dict:
    start = time.monotonic()
    response = openai.ChatCompletion.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    latency_ms = (time.monotonic() - start) * 1000

    usage = response["usage"]
    cost = calculate_cost(usage["prompt_tokens"], usage["completion_tokens"], model)

    # Record metrics
    llm_tokens_in.add(usage["prompt_tokens"], {"model": model})
    llm_tokens_out.add(usage["completion_tokens"], {"model": model})
    llm_cost.record(cost, {"model": model})
    llm_latency.record(latency_ms, {"model": model})

    return response

def calculate_cost(prompt_tokens: int, completion_tokens: int, model: str) -> float:
    # GPT-4 pricing as of 2026
    pricing = {
        "gpt-4": (0.00003, 0.00006),   # ($/token) prompt, completion
        "gpt-4o": (0.000005, 0.000015),
        "gpt-3.5-turbo": (0.0000015, 0.000002),
    }
    p, c = pricing.get(model, (0.0, 0.0))
    return (prompt_tokens * p) + (completion_tokens * c)

Step 3 - Add Semantic Conventions for LLM Spans

OpenTelemetry's semantic conventions standardize span and metric attribute names. Use the LLM conventions to add model and request context:

from opentelemetry.trace import SpanKind, Status, StatusCode

def wrap_llm_span(span_name: str, model: str, prompt_tokens: int):
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span(
        span_name,
        kind=SpanKind.CLIENT,
        attributes={
            "gen_ai.system": "openai",
            "gen_ai.request.model": model,
            "gen_ai.request.max_tokens": 2048,
            "llm.prompt_tokens": prompt_tokens,
        }
    ) as span:
        yield span
        # After the call, update span with response attributes
        span.set_attribute("gen_ai.response.id", response["id"])
        span.set_attribute("gen_ai.response.model", response["model"])
        span.set_attribute("gen_ai.choice.finish_reason", response["choices"][0]["finish_reason"])
        span.set_status(Status(StatusCode.OK))

The key gen_ai.* attributes follow the OTel semantic conventions for Generative AI. They enable Grafana dashboards to filter and group by model, system, and operation - critical for multi-model deployments.

Recommended Tool Visualize LLM Metrics in Grafana

Grafana Cloud gives you hosted Prometheus, Grafana dashboards, and alerting - no infrastructure to manage. Import the dashboard JSON from this tutorial and monitor token usage, cost, and latency in minutes.

Step 4 - Deploy the OpenTelemetry Collector

The OTel Collector receives telemetry from your application and exports it to your backends. Deploy it as a Docker sidecar or Kubernetes DaemonSet:

# otel-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: llm-otel-collector
spec:
  mode: daemonset
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024

      memory_limiter:
        check_interval: 1s
        limit_mib: 512

    exporters:
      prometheus:
        endpoint: "0.0.0.0:8889"
        namespace: "llm"
        const_labels:
          service: llm-monitor

      otlp/tempo:
        endpoint: tempo:4317
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp/tempo]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [prometheus]

The Collector exposes a Prometheus metrics endpoint (8889) that Prometheus scrapes. This decouples your application from the metrics backend - if Prometheus is down, your app keeps running and buffers data.

Step 5 - Configure Prometheus to Scrape OTel Metrics

Add a scrape config to Prometheus:

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: 'llm-app'

If you are on Kubernetes, use the Prometheus Operator with a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: otel-collector
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: opentelemetry
  endpoints:
    - port: prometheus
      path: /metrics
Advertisement
Advertisement

Step 6 - Build the Grafana Dashboard

Now visualize the data. Create a Grafana dashboard with four panels:

Panel 1: Token Usage Over Time

# Input tokens rate (per minute)
rate(llm_tokens_input_total[5m])

# Output tokens rate (per minute)
rate(llm_tokens_output_total[5m])

Group by model to see per-model breakdown. Spike in input tokens is often the first signal of a prompt engineering experiment going wrong.

Panel 2: LLM Cost Accumulation

# Cumulative cost in USD
sum(increase(llm_cost_usd_sum[1h]))

# Cost per model
sum by (model) (increase(llm_cost_usd_sum[1h]))

Set a budget alert: sum(llm_cost_usd_sum) > 100 triggers when daily spend crosses a threshold.

Panel 3: Latency Distribution (P50/P95/P99)

# Latency percentiles
histogram_quantile(0.50, rate(llm_latency_ms_bucket[5m]))
histogram_quantile(0.95, rate(llm_latency_ms_bucket[5m]))
histogram_quantile(0.99, rate(llm_latency_ms_bucket[5m]))

High P99 latency with normal P50 tells you about tail-end issues - often GPU memory pressure or KV cache thrashing in vLLM.

Panel 4: Error Rate

# LLM API errors (depends on how you record errors)
rate(llm_errors_total[5m])

# Rate-limited responses
rate(llm_rateLimited_total[5m])

Grafana Dashboard JSON Template

Import this JSON into Grafana to get a pre-built dashboard:

{
  "dashboard": {
    "title": "LLM Monitoring Stack",
    "uid": "llm-monitoring",
    "panels": [
      {
        "title": "Token Usage (Input vs Output)",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "rate(llm_tokens_input_total[5m])",
            "legendFormat": "Input tokens/min"
          },
          {
            "expr": "rate(llm_tokens_output_total[5m])",
            "legendFormat": "Output tokens/min"
          }
        ]
      },
      {
        "title": "LLM Cost ($/hour)",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(llm_cost_usd_sum[5m])) * 3600",
            "legendFormat": "Estimated $/hour"
          }
        ]
      },
      {
        "title": "Latency Percentiles",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "histogram_quantile(0.50, rate(llm_latency_ms_bucket[5m]))",
            "legendFormat": "P50"
          },
          {
            "expr": "histogram_quantile(0.95, rate(llm_latency_ms_bucket[5m]))",
            "legendFormat": "P95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(llm_latency_ms_bucket[5m]))",
            "legendFormat": "P99"
          }
        ]
      },
      {
        "title": "Requests by Model",
        "type": "piechart",
        "gridPos": {"h": 8, "w": 6, "x": 12, "y": 8},
        "targets": [
          {
            "expr": "sum by (model) (rate(llm_tokens_input_total[5m]))",
            "legendFormat": "{{model}}"
          }
        ]
      }
    ]
  }
}

Save this as grafana-llm-dashboard.json and import it via Grafana UI → Dashboards → Import.

Step 7 - Add Alerting Rules

Prometheus alerting rules catch problems before users do. Add to your alerts.yml:

groups:
  - name: llm_alerts
    rules:
      # Token budget breach
      - alert: LLMTokensOverBudget
        expr: increase(llm_tokens_input_total[1h]) > 1000000
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "High token usage detected"
          description: "LLM input tokens grew by {{ $value | humanize }} in the last hour."

      # Latency spike
      - alert: LLMLatencyHigh
        expr: histogram_quantile(0.99, rate(llm_latency_ms_bucket[5m])) > 5000
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "LLM P99 latency above 5s"
          description: "P99 latency is {{ $value | humanizeDuration }} - likely GPU queueing or model overload."

      # Cost overrun
      - alert: LLM_cost_overrun
        expr: increase(llm_cost_usd_sum[1h]) > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM cost rate exceeds $50/hour"
          description: "Current spend rate: ${{ $value | humanize }}/hour. Review prompt efficiency or model routing."

Route these alerts to PagerDuty, Slack, or any webhook via Alertmanager.

Adding Traces: OpenTelemetry Tempo Integration

Metrics tell you that something is wrong. Traces tell you where. For LLM applications, traces are critical because a single user request can trigger multiple model calls, a vector search, a retrieval step, and a formatting pass - all running concurrently.

Enable trace context propagation in your LLM calls:

from opentelemetry import trace
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

propagator = TraceContextTextMapPropagator()

def call_llm_with_trace(prompt: str, context: dict):
    # Extract any incoming trace context (from an API request, for example)
    ctx = propagator.extract(carrier=context.get("headers", {}))
    
    tracer = trace.get_tracer(__name__)
    
    with tracer.start_as_current_span(
        "llm.generation",
        context=ctx,
        kind=SpanKind.CLIENT,
        attributes={
            "gen_ai.system": "openai",
            "gen_ai.request.model": "gpt-4o",
            "gen_ai.request.max_tokens": 1024,
        }
    ) as span:
        response = openai.ChatCompletion.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )
        
        span.set_attribute("gen_ai.response.id", response["id"])
        span.set_attribute("gen_ai.response.finish_reason",
            response["choices"][0]["finish_reason"])
        span.set_attribute("gen_ai.usage.prompt_tokens",
            response["usage"]["prompt_tokens"])
        span.set_attribute("gen_ai.usage.completion_tokens",
            response["usage"]["completion_tokens"])
        
        return response

The trace context flows into Grafana Tempo. In Grafana, click any span to see the full request timeline - including how long the vector retrieval took vs. the actual model inference.

Multi-Model Routing: A Key LLM Monitoring Pattern

A mature LLM stack rarely runs on a single model. Production systems route requests based on task complexity:

Task Model Cost
Simple classification GPT-3.5-Turbo $0.001/1K tokens
Standard Q&A GPT-4o $0.015/1K tokens
Complex reasoning GPT-4 $0.06/1K tokens

Track routing decisions with OTel attributes:

def route_and_call_llm(task: str, prompt: str) -> dict:
    # Route to appropriate model
    if task == "classify":
        model = "gpt-3.5-turbo"
    elif task == "reason":
        model = "gpt-4"
    else:
        model = "gpt-4o"

    with tracer.start_as_current_span("llm.route") as span:
        span.set_attribute("llm.routed_model", model)
        span.set_attribute("llm.task_type", task)
        
        return call_llm(prompt, model=model)

In Grafana, filter your token and cost panels by llm.task_type to understand spend distribution by use case - and identify opportunities to route more requests to cheaper models.

Scaling Considerations

For high-throughput LLM applications (100+ requests/second), the monitoring stack needs to keep up:

Collector scaling. Run the OTel Collector as a Kubernetes Deployment with HPA (Horizontal Pod Autoscaler) targeting CPU > 70%. For >1000 RPS, consider the OTel Collector's gateway mode with a load balancer.

Prometheus resource planning. A busy LLM service emits metrics on every request. At 100 RPS with 1 metric per call, Prometheus needs to handle 100 new time series per second. Use rate() queries over 5-minute windows to keep Grafana responsive.

Cardinality management. Never add high-cardinality attributes (user IDs, request IDs) as metric labels. Use traces for per-request detail, metrics for aggregate health. A metric label with 10,000 possible values will kill Prometheus.

OTLP batching. Configure the OTel SDK to batch spans and metrics before sending:

from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

exporter = OTLPMetricExporter(insecure=True)
reader = PeriodicExportingMetricReader(exporter, export_interval_millis=5000)
meter_provider = MeterProvider(resource=resource, metric_readers=[reader])

What to Monitor Next

This tutorial covers the foundation. From here, expand your observability stack in three directions:

Hallucination detection. Add an evaluation layer that cross-checks model outputs against ground truth at scale. Arize Phoenix (open source) and LangSmith both integrate with OTel for continuous evaluation.

vLLM-specific metrics. If you are running vLLM as your inference server, instrument the vllm/ metrics prefix (GPU utilization, KV cache hit rate, batch scheduler latency) via the Prometheus endpoint that vLLM exposes on port 8000.

Fine-tuning cost tracking. Monitor dataset tokenization costs and training job resource usage - they dwarf inference costs at scale.

Conclusion

OpenTelemetry, Prometheus, and Grafana give you enterprise-grade LLM observability without enterprise-grade cost or vendor lock-in. The stack handles token counting, cost tracking, latency analysis, error monitoring, and trace-level debugging - everything you need to run LLM applications with confidence.

Start with auto-instrumentation, add custom metrics for token and cost tracking, deploy the OTel Collector, and import the Grafana dashboard. You will have a working stack in an afternoon.

The signals you capture today become the alerts that prevent incidents tomorrow.