Why Traditional Tracing Falls Short for LLMs
Standard distributed traces capture request-response pairs across service boundaries. LLM inference is different:
- Long, streaming responses — a single prompt generates hundreds of spans as tokens arrive sequentially
- Non-deterministic output — the same prompt can produce wildly different traces depending on sampling parameters
- Nested tool calls — an LLM agent can trigger cascading downstream API calls based on generated content
- Context propagation — prompt history, retrieved documents, and system instructions all affect output but aren't naturally captured in standard spans
OpenTelemetry's semantic conventions were extended in 2024-2025 to cover LLM-specific operations. Understanding these conventions is the first step to building meaningful observability for AI systems.
Setting Up the OTel SDK for LLM Instrumentation
The OpenTelemetry Python SDK provides first-class support for LLM instrumentation through the opentelemetry-instrumentation-openai and opentelemetry-instrumentation-litellm packages.
# Install dependencies
# pip install opentelemetry-api \
# opentelemetry-sdk \
# opentelemetry-exporter-otlp \
# opentelemetry-instrumentation-openai \
# opentelemetry-instrumentation-litellm
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
# Initialize the tracer provider with service identity
resource = Resource.create({
SERVICE_NAME: "llm-inference-service",
"service.version": "1.0.0",
"deployment.environment": "production",
})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
The resource attributes (service.name, deployment.environment) appear in every span and enable filtering in your observability backend.
Tracing OpenAI-Compatible API Calls
If your inference runs through an OpenAI-compatible endpoint (OpenAI, Azure OpenAI, local vLLM, Ollama, or any LiteLLM proxy), the openai instrumentation package captures spans automatically:
from openai import OpenAI
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
# One-line instrumentation — wraps all OpenAI API calls
OpenAIInstrumentor().instrument()
client = OpenAI(api_key="sk-...")
with tracer.start_as_current_span("llm.summary-task") as span:
span.set_attribute("llm.prompt.template", "Summarize: {text}")
span.set_attribute("llm.prompt.variables.text", user_input[:200])
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a technical summarizer."},
{"role": "user", "content": user_input}
],
temperature=0.3,
max_tokens=500,
)
# Extract response attributes for the span
span.set_attribute("llm.response.model", response.model)
span.set_attribute("llm.response.usage.prompt_tokens", response.usage.prompt_tokens)
span.set_attribute("llm.response.usage.completion_tokens", response.usage.completion_tokens)
span.set_attribute("llm.response.usage.total_tokens", response.usage.total_tokens)
span.set_attribute("llm.response.finish_reason", response.choices[0].finish_reason)
summary = response.choices[0].message.content
The openai instrumentation automatically creates child spans for each API call, capturing token counts, model selection, and latency. No manual span management required for the hot path.
Instrumenting Streaming Responses with Custom Spans
Streaming responses are the hardest case. The OpenAI SDK streams CompletionChunk events; each event contains a partial delta. Naively instrumenting each chunk creates an unmanageable span explosion. Instead, use a single parent span with chunk count attributes:
async def stream_completion(client, prompt: str, model: str = "gpt-4o"):
with tracer.start_as_current_span("llm.stream-completion") as span:
span.set_attribute("llm.prompt.length", len(prompt))
span.set_attribute("llm.model", model)
stream = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
stream=True,
stream_options={"include_usage": True}
)
full_response = ""
chunk_count = 0
# Consume the stream
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
full_response += content
chunk_count += 1
span.set_attribute("llm.response.chunk_count", chunk_count)
span.set_attribute("llm.response.full_length", len(full_response))
span.set_attribute("llm.response.first_token_latency_ms", first_token_ms)
return full_response
The stream_options={"include_usage": True} parameter (available in OpenAI SDK v1.35+) provides final token counts at the end of the stream, avoiding the need to accumulate counts during streaming.
Capturing LLM-as-Tool Calls with Semantic Conventions
Agentic AI systems use LLMs to decide when and how to call external tools. The OpenTelemetry semantic conventions for LLM operations (established in OTEP 0234) define standard attribute names for these interactions:
import json
def trace_tool_call(span, tool_name: str, arguments: dict, result: str, latency_ms: float):
"""
Record an LLM-initiated tool call using OTel semantic conventions.
https://opentelemetry.io/docs/specs/semconv/gen-ai/llm-semantic-conventions/
"""
span.set_attribute("gen_ai.system", "openai")
span.set_attribute("gen_ai.operation", "chat")
span.set_attribute("gen_ai.tool.name", tool_name)
span.set_attribute("gen_ai.tool.call.arguments", json.dumps(arguments))
span.set_attribute("gen_ai.tool.call.result", result[:500]) # Truncate for storage
span.set_attribute("gen_ai.tool.call.duration_ms", latency_ms)
The gen_ai.* prefix is the standard namespace for generative AI attributes in OpenTelemetry semantic conventions. These attributes enable you to filter and aggregate by tool name in Grafana, Honeycomb, or any OTel-compatible backend.
Context Propagation Across LLM Pipeline Stages
A production LLM pipeline typically involves multiple stages: prompt templating, retrieval augmentation, inference, and response post-processing. Each stage should propagate the same trace context to enable end-to-end latency analysis:
from opentelemetry.context import attach, detach
from opentelemetry.trace import SpanKind
def process_with_context(user_input: str, retrieved_docs: list[dict]):
# Extract the current trace context from the incoming request
# (assumes W3C trace context headers are present in the HTTP request)
context = extract_context_from_headers(request.headers)
token = attach(context)
try:
with tracer.start_as_current_span(
"llm.pipeline.full",
kind=SpanKind.INTERNAL,
) as span:
# Stage 1: Retrieval
with tracer.start_as_current_span("llm.pipeline.retrieval") as retrieval_span:
retrieval_span.set_attribute("retrieval.doc_count", len(retrieved_docs))
retrieval_span.set_attribute("retrieval.avg_doc_length", sum(len(d['content']) for d in retrieved_docs) // max(len(retrieved_docs), 1))
# Stage 2: Prompt assembly
with tracer.start_as_current_span("llm.pipeline.prompt-assembly") as prompt_span:
prompt = assemble_rag_prompt(user_input, retrieved_docs)
prompt_span.set_attribute("prompt.final_length", len(prompt))
# Stage 3: Inference (child of pipeline span, sibling of retrieval/prompt)
with tracer.start_as_current_span("llm.pipeline.inference") as inference_span:
response_text = call_llm(prompt)
inference_span.set_attribute("llm.response.length", len(response_text))
finally:
detach(token)
The W3C traceparent header propagates context across HTTP boundaries. If your pipeline involves async message queues (SQS, Pub/Sub), embed the serialized context in the message payload:
from opentelemetry.propagate import inject, extract
from opentelemetry.context import Context
def publish_to_queue(queue_url: str, payload: dict, parent_context: Context):
"""Inject trace context into a queue message for cross-service propagation."""
headers = {}
inject(headers, context=parent_context)
# headers now contains traceparent, tracestate for W3C propagation
payload["_trace_ctx"] = headers
sqs.send_message(QueueUrl=queue_url, MessageBody=json.dumps(payload))
OTel Collector Configuration for AI Traces
The OpenTelemetry Collector processes and exports your AI inference traces. A minimal production config for LLM traces:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1000
# Filter out health-check spans
filter:
spans:
exclude:
match_type: strict
attributes:
- key: http.target
value: /healthz
# Redact PII from prompt content in traces
transform:
error_mode: ignore
trace_statements:
- context: span
statements:
- replace_pattern(attributes["llm.prompt.content"], "user_email=\".*\"", "user_email=\"[REDACTED]\"")
- replace_pattern(attributes["gen_ai.tool.call.arguments"], "\"api_key\":\".*\"", "\"api_key\":\"[REDACTED]\"")
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
namespace: llm_inference
const_labels:
service: llm-inference-service
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, filter, transform]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
The transform processor is critical for compliance — prompts often contain user PII that shouldn't appear in your trace storage backend.
Key Metrics to Capture from LLM Pipelines
Beyond traces, your LLM pipeline should emit these metrics for capacity planning and cost monitoring:
| Metric | Type | Description |
|---|---|---|
llm.request.duration | Histogram | End-to-end latency from prompt receipt to final token |
llm.tokens.prompt | Counter | Total prompt tokens processed |
llm.tokens.completion | Counter | Total completion tokens generated |
llm.tokens.cost | Counter | Estimated API cost in USD |
llm.stream.first_token_latency | Histogram | Time to first token (TTFT) |
llm.error.rate | Gauge | Rate of API errors / rate limit hits |
llm.tool.call.duration | Histogram | Per-tool call latency |
Emit these with the OpenTelemetry Metrics API:
from opentelemetry import metrics
meter = metrics.get_meter(__name__)
token_counter = meter.create_counter(
name="llm.tokens.total",
description="Total tokens processed",
unit="1",
)
# Record token usage after each response
token_counter.add(
response.usage.prompt_tokens,
{"token.type": "prompt", "model": response.model}
)
token_counter.add(
response.usage.completion_tokens,
{"token.type": "completion", "model": response.model}
)
Grafana Dashboard for LLM Pipeline Observability
With traces flowing into Grafana Tempo and metrics into Prometheus/Grafana Cloud, here's the minimum viable dashboard layout for LLM pipeline observability:
Panel 1: Request Volume + Error Rate
sum(rate(llm_inference_requests_total[5m])) by (model)
| vs |
sum(rate(llm_inference_errors_total[5m])) by (model)
Alert threshold: error rate > 1% triggers PagerDuty.
Panel 2: Token Cost by Model
sum(increase(llm_tokens_total[24h])) by (model, token.type) * token_price_per_1k
This panel is essential for FinOps — you should know within 5% accuracy what each model costs per day.
Panel 3: P50/P95/P99 Latency Heatmap
histogram_quantile(0.95, sum(rate(llm_request_duration_bucket[5m])) by (le, model))
P99 > 30s for streaming models is a signal to investigate cold start issues or rate limiting.
Panel 4: Tool Call Frequency
sum(rate(gen_ai_tool_calls_total[5m])) by (tool.name)
Spikes in tool call frequency indicate your agent is entering loops or encountering ambiguous prompts.
Panel 5: First Token Latency Distribution
histogram_quantile(0.50, sum(rate(llm_first_token_latency_bucket[5m])) by (le, model))
TTFT > 5s on a streaming endpoint will feel unresponsive to users. Establish baselines per model version.
Grafana Cloud is the easiest way to get started with LLM observability — native OTel support, pre-built dashboards for LLM traces, and pay-as-you-go pricing for inference-heavy workloads.
Common Pitfalls in LLM OTel Instrumentation
Pitfall 1: Capturing full prompt content in every span. Prompts can be thousands of tokens. Storing the full prompt in every span quickly inflates your trace storage costs. Instead, record llm.prompt.length and llm.prompt.template, storing the actual prompt in a separate document store keyed by a hash.
Pitfall 2: Missing sampling for high-volume endpoints. If your inference service handles 10,000 requests/minute, 100% trace sampling is expensive. Use Tail-Based Sampling (available in OTel Collector) to capture 100% of error traces but only 1-5% of successful traces.
Pitfall 3: Forgetting to propagate context through async boundaries. LLM pipelines often use background queues (Celery, SQS, RabbitMQ). If you don't serialize and propagate the W3C trace context, your traces will show up as separate unconnected spans rather than a single pipeline trace.
Pitfall 4: Ignoring model version drift. The same model name can return different versions over time (GPT-4-Turbo's underlying weights are updated). Add model.version or model.revision to your resource attributes if your inference provider exposes this information.
Putting It Together: A Complete Instrumented Pipeline
Here's the full pattern — a FastAPI endpoint with complete OTel instrumentation including context propagation, streaming support, and cost attribution:
from fastapi import FastAPI, Request
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.propagate import extract
from opentelemetry import trace, metrics
app = FastAPI(__name__)
FastAPIInstrumentor.instrument_app(app)
meter = metrics.get_meter(__name__)
request_counter = meter.create_counter("http.llm.requests", "LLM HTTP requests")
token_histogram = meter.create_histogram("llm.tokens", "LLM token usage")
@app.post("/v1/chat")
async def chat(request: Request, body: ChatRequest):
ctx = extract(dict(request.headers))
with trace.get_tracer(__name__).start_active_span(
"chat",
context=ctx,
attributes={"http.method": "POST", "llm.model": body.model}
) as span:
# Call LLM (instrumented automatically via OpenAIInstrumentor)
response = client.chat.completions.create(
model=body.model,
messages=body.messages,
stream=body.stream,
)
if body.stream:
return StreamingResponse(stream_tokens(response, span))
else:
return format_response(response)
The combination of automatic instrumentation (OpenAIInstrumentor handles the API calls) and manual spans (you control business logic boundaries) gives you complete visibility without excessive overhead.
Conclusion
OpenTelemetry has matured into the standard for LLM observability, but applying it to AI inference pipelines requires understanding its AI-specific semantic conventions. The key patterns — streaming span aggregation, gen_ai.* attributes for tool calls, W3C context propagation across async boundaries, and PII redaction — are what separate production-grade AI observability from toy implementations.
Building this instrumentation now pays compound dividends as your inference volume grows: the same traces that help you debug a single incident become the dataset for capacity planning, cost attribution, and model selection decisions.
Start with the OpenAI instrumentation package (one line of code), add custom spans for your pipeline-specific logic, and wire up a Grafana dashboard for the metrics that matter to your team. From there, expand into tail-based sampling and cost attribution as your observability needs mature.