Why Traditional Tracing Falls Short for LLMs

Standard distributed traces capture request-response pairs across service boundaries. LLM inference is different:

  • Long, streaming responses — a single prompt generates hundreds of spans as tokens arrive sequentially
  • Non-deterministic output — the same prompt can produce wildly different traces depending on sampling parameters
  • Nested tool calls — an LLM agent can trigger cascading downstream API calls based on generated content
  • Context propagation — prompt history, retrieved documents, and system instructions all affect output but aren't naturally captured in standard spans

OpenTelemetry's semantic conventions were extended in 2024-2025 to cover LLM-specific operations. Understanding these conventions is the first step to building meaningful observability for AI systems.


Setting Up the OTel SDK for LLM Instrumentation

The OpenTelemetry Python SDK provides first-class support for LLM instrumentation through the opentelemetry-instrumentation-openai and opentelemetry-instrumentation-litellm packages.

# Install dependencies
# pip install opentelemetry-api \
#   opentelemetry-sdk \
#   opentelemetry-exporter-otlp \
#   opentelemetry-instrumentation-openai \
#   opentelemetry-instrumentation-litellm

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME

# Initialize the tracer provider with service identity
resource = Resource.create({
    SERVICE_NAME: "llm-inference-service",
    "service.version": "1.0.0",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

The resource attributes (service.name, deployment.environment) appear in every span and enable filtering in your observability backend.


Tracing OpenAI-Compatible API Calls

If your inference runs through an OpenAI-compatible endpoint (OpenAI, Azure OpenAI, local vLLM, Ollama, or any LiteLLM proxy), the openai instrumentation package captures spans automatically:

from openai import OpenAI
from opentelemetry.instrumentation.openai import OpenAIInstrumentor

# One-line instrumentation — wraps all OpenAI API calls
OpenAIInstrumentor().instrument()

client = OpenAI(api_key="sk-...")

with tracer.start_as_current_span("llm.summary-task") as span:
    span.set_attribute("llm.prompt.template", "Summarize: {text}")
    span.set_attribute("llm.prompt.variables.text", user_input[:200])

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a technical summarizer."},
            {"role": "user", "content": user_input}
        ],
        temperature=0.3,
        max_tokens=500,
    )

    # Extract response attributes for the span
    span.set_attribute("llm.response.model", response.model)
    span.set_attribute("llm.response.usage.prompt_tokens", response.usage.prompt_tokens)
    span.set_attribute("llm.response.usage.completion_tokens", response.usage.completion_tokens)
    span.set_attribute("llm.response.usage.total_tokens", response.usage.total_tokens)
    span.set_attribute("llm.response.finish_reason", response.choices[0].finish_reason)

    summary = response.choices[0].message.content

The openai instrumentation automatically creates child spans for each API call, capturing token counts, model selection, and latency. No manual span management required for the hot path.


Instrumenting Streaming Responses with Custom Spans

Streaming responses are the hardest case. The OpenAI SDK streams CompletionChunk events; each event contains a partial delta. Naively instrumenting each chunk creates an unmanageable span explosion. Instead, use a single parent span with chunk count attributes:

async def stream_completion(client, prompt: str, model: str = "gpt-4o"):
    with tracer.start_as_current_span("llm.stream-completion") as span:
        span.set_attribute("llm.prompt.length", len(prompt))
        span.set_attribute("llm.model", model)

        stream = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            stream=True,
            stream_options={"include_usage": True}
        )

        full_response = ""
        chunk_count = 0

        # Consume the stream
        for chunk in stream:
            if chunk.choices and chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                full_response += content
                chunk_count += 1

        span.set_attribute("llm.response.chunk_count", chunk_count)
        span.set_attribute("llm.response.full_length", len(full_response))
        span.set_attribute("llm.response.first_token_latency_ms", first_token_ms)

        return full_response

The stream_options={"include_usage": True} parameter (available in OpenAI SDK v1.35+) provides final token counts at the end of the stream, avoiding the need to accumulate counts during streaming.


Capturing LLM-as-Tool Calls with Semantic Conventions

Agentic AI systems use LLMs to decide when and how to call external tools. The OpenTelemetry semantic conventions for LLM operations (established in OTEP 0234) define standard attribute names for these interactions:

import json

def trace_tool_call(span, tool_name: str, arguments: dict, result: str, latency_ms: float):
    """
    Record an LLM-initiated tool call using OTel semantic conventions.
    https://opentelemetry.io/docs/specs/semconv/gen-ai/llm-semantic-conventions/
    """
    span.set_attribute("gen_ai.system", "openai")
    span.set_attribute("gen_ai.operation", "chat")
    span.set_attribute("gen_ai.tool.name", tool_name)
    span.set_attribute("gen_ai.tool.call.arguments", json.dumps(arguments))
    span.set_attribute("gen_ai.tool.call.result", result[:500])  # Truncate for storage
    span.set_attribute("gen_ai.tool.call.duration_ms", latency_ms)

The gen_ai.* prefix is the standard namespace for generative AI attributes in OpenTelemetry semantic conventions. These attributes enable you to filter and aggregate by tool name in Grafana, Honeycomb, or any OTel-compatible backend.


Context Propagation Across LLM Pipeline Stages

A production LLM pipeline typically involves multiple stages: prompt templating, retrieval augmentation, inference, and response post-processing. Each stage should propagate the same trace context to enable end-to-end latency analysis:

from opentelemetry.context import attach, detach
from opentelemetry.trace import SpanKind

def process_with_context(user_input: str, retrieved_docs: list[dict]):
    # Extract the current trace context from the incoming request
    # (assumes W3C trace context headers are present in the HTTP request)
    context = extract_context_from_headers(request.headers)

    token = attach(context)

    try:
        with tracer.start_as_current_span(
            "llm.pipeline.full",
            kind=SpanKind.INTERNAL,
        ) as span:
            # Stage 1: Retrieval
            with tracer.start_as_current_span("llm.pipeline.retrieval") as retrieval_span:
                retrieval_span.set_attribute("retrieval.doc_count", len(retrieved_docs))
                retrieval_span.set_attribute("retrieval.avg_doc_length", sum(len(d['content']) for d in retrieved_docs) // max(len(retrieved_docs), 1))

            # Stage 2: Prompt assembly
            with tracer.start_as_current_span("llm.pipeline.prompt-assembly") as prompt_span:
                prompt = assemble_rag_prompt(user_input, retrieved_docs)
                prompt_span.set_attribute("prompt.final_length", len(prompt))

            # Stage 3: Inference (child of pipeline span, sibling of retrieval/prompt)
            with tracer.start_as_current_span("llm.pipeline.inference") as inference_span:
                response_text = call_llm(prompt)
                inference_span.set_attribute("llm.response.length", len(response_text))

    finally:
        detach(token)

The W3C traceparent header propagates context across HTTP boundaries. If your pipeline involves async message queues (SQS, Pub/Sub), embed the serialized context in the message payload:

from opentelemetry.propagate import inject, extract
from opentelemetry.context import Context

def publish_to_queue(queue_url: str, payload: dict, parent_context: Context):
    """Inject trace context into a queue message for cross-service propagation."""
    headers = {}
    inject(headers, context=parent_context)
    # headers now contains traceparent, tracestate for W3C propagation
    payload["_trace_ctx"] = headers
    sqs.send_message(QueueUrl=queue_url, MessageBody=json.dumps(payload))

OTel Collector Configuration for AI Traces

The OpenTelemetry Collector processes and exports your AI inference traces. A minimal production config for LLM traces:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1000

  # Filter out health-check spans
  filter:
    spans:
      exclude:
        match_type: strict
        attributes:
          - key: http.target
            value: /healthz

  # Redact PII from prompt content in traces
  transform:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          - replace_pattern(attributes["llm.prompt.content"], "user_email=\".*\"", "user_email=\"[REDACTED]\"")
          - replace_pattern(attributes["gen_ai.tool.call.arguments"], "\"api_key\":\".*\"", "\"api_key\":\"[REDACTED]\"")

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: llm_inference
    const_labels:
      service: llm-inference-service

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, filter, transform]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

The transform processor is critical for compliance — prompts often contain user PII that shouldn't appear in your trace storage backend.


Key Metrics to Capture from LLM Pipelines

Beyond traces, your LLM pipeline should emit these metrics for capacity planning and cost monitoring:

Metric Type Description
llm.request.duration Histogram End-to-end latency from prompt receipt to final token
llm.tokens.prompt Counter Total prompt tokens processed
llm.tokens.completion Counter Total completion tokens generated
llm.tokens.cost Counter Estimated API cost in USD
llm.stream.first_token_latency Histogram Time to first token (TTFT)
llm.error.rate Gauge Rate of API errors / rate limit hits
llm.tool.call.duration Histogram Per-tool call latency

Emit these with the OpenTelemetry Metrics API:

from opentelemetry import metrics

meter = metrics.get_meter(__name__)
token_counter = meter.create_counter(
    name="llm.tokens.total",
    description="Total tokens processed",
    unit="1",
)

# Record token usage after each response
token_counter.add(
    response.usage.prompt_tokens,
    {"token.type": "prompt", "model": response.model}
)
token_counter.add(
    response.usage.completion_tokens,
    {"token.type": "completion", "model": response.model}
)

Grafana Dashboard for LLM Pipeline Observability

With traces flowing into Grafana Tempo and metrics into Prometheus/Grafana Cloud, here's the minimum viable dashboard layout for LLM pipeline observability:

Panel 1: Request Volume + Error Rate

sum(rate(llm_inference_requests_total[5m])) by (model)
| vs |
sum(rate(llm_inference_errors_total[5m])) by (model)

Alert threshold: error rate > 1% triggers PagerDuty.

Panel 2: Token Cost by Model

sum(increase(llm_tokens_total[24h])) by (model, token.type) * token_price_per_1k

This panel is essential for FinOps — you should know within 5% accuracy what each model costs per day.

Panel 3: P50/P95/P99 Latency Heatmap

histogram_quantile(0.95, sum(rate(llm_request_duration_bucket[5m])) by (le, model))

P99 > 30s for streaming models is a signal to investigate cold start issues or rate limiting.

Panel 4: Tool Call Frequency

sum(rate(gen_ai_tool_calls_total[5m])) by (tool.name)

Spikes in tool call frequency indicate your agent is entering loops or encountering ambiguous prompts.

Panel 5: First Token Latency Distribution

histogram_quantile(0.50, sum(rate(llm_first_token_latency_bucket[5m])) by (le, model))

TTFT > 5s on a streaming endpoint will feel unresponsive to users. Establish baselines per model version.

Recommended Tool Grafana Cloud

Grafana Cloud is the easiest way to get started with LLM observability — native OTel support, pre-built dashboards for LLM traces, and pay-as-you-go pricing for inference-heavy workloads.


Common Pitfalls in LLM OTel Instrumentation

Pitfall 1: Capturing full prompt content in every span. Prompts can be thousands of tokens. Storing the full prompt in every span quickly inflates your trace storage costs. Instead, record llm.prompt.length and llm.prompt.template, storing the actual prompt in a separate document store keyed by a hash.

Pitfall 2: Missing sampling for high-volume endpoints. If your inference service handles 10,000 requests/minute, 100% trace sampling is expensive. Use Tail-Based Sampling (available in OTel Collector) to capture 100% of error traces but only 1-5% of successful traces.

Pitfall 3: Forgetting to propagate context through async boundaries. LLM pipelines often use background queues (Celery, SQS, RabbitMQ). If you don't serialize and propagate the W3C trace context, your traces will show up as separate unconnected spans rather than a single pipeline trace.

Pitfall 4: Ignoring model version drift. The same model name can return different versions over time (GPT-4-Turbo's underlying weights are updated). Add model.version or model.revision to your resource attributes if your inference provider exposes this information.


Putting It Together: A Complete Instrumented Pipeline

Here's the full pattern — a FastAPI endpoint with complete OTel instrumentation including context propagation, streaming support, and cost attribution:

from fastapi import FastAPI, Request
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.propagate import extract
from opentelemetry import trace, metrics

app = FastAPI(__name__)
FastAPIInstrumentor.instrument_app(app)

meter = metrics.get_meter(__name__)
request_counter = meter.create_counter("http.llm.requests", "LLM HTTP requests")
token_histogram = meter.create_histogram("llm.tokens", "LLM token usage")

@app.post("/v1/chat")
async def chat(request: Request, body: ChatRequest):
    ctx = extract(dict(request.headers))
    with trace.get_tracer(__name__).start_active_span(
        "chat",
        context=ctx,
        attributes={"http.method": "POST", "llm.model": body.model}
    ) as span:
        # Call LLM (instrumented automatically via OpenAIInstrumentor)
        response = client.chat.completions.create(
            model=body.model,
            messages=body.messages,
            stream=body.stream,
        )

        if body.stream:
            return StreamingResponse(stream_tokens(response, span))
        else:
            return format_response(response)

The combination of automatic instrumentation (OpenAIInstrumentor handles the API calls) and manual spans (you control business logic boundaries) gives you complete visibility without excessive overhead.

Advertisement
Advertisement


Conclusion

OpenTelemetry has matured into the standard for LLM observability, but applying it to AI inference pipelines requires understanding its AI-specific semantic conventions. The key patterns — streaming span aggregation, gen_ai.* attributes for tool calls, W3C context propagation across async boundaries, and PII redaction — are what separate production-grade AI observability from toy implementations.

Building this instrumentation now pays compound dividends as your inference volume grows: the same traces that help you debug a single incident become the dataset for capacity planning, cost attribution, and model selection decisions.

Start with the OpenAI instrumentation package (one line of code), add custom spans for your pipeline-specific logic, and wire up a Grafana dashboard for the metrics that matter to your team. From there, expand into tail-based sampling and cost attribution as your observability needs mature.