Your user types a question into your AI assistant and watches the loading spinner crawl. Three seconds. Five seconds. Seven seconds. They leave.

For every 100ms of added latency in LLM response streaming, engagement drops by 1%. That is not a vendor statistic pulled from thin air — it is the observable pattern across every production LLM deployment we have instrumented. When a user cannot see the model "thinking," they assume it is broken. When the first token takes too long, they leave before the streaming response ever begins.

LLM latency monitoring is not the same as monitoring a REST API. A 200ms p99 on your user service does not tell you anything about whether your users are seeing the first token of an LLM response within 2 seconds. The metrics are different, the anatomy of the delay is different, and the optimization strategies are in a completely different category.

This guide covers the metrics that matter for LLM latency, how to instrument them with OpenTelemetry and Prometheus, how to build Grafana dashboards that surface latency degradation before your users do, and the optimization techniques that move the needle.


Why LLM Latency Is Different From Traditional API Latency

Traditional API latency follows a simple model: client sends request, server processes, server returns response. You measure p50, p95, p99, and you are done. The entire operation is atomic — you get a complete response or an error.

LLM inference breaks this model in three fundamental ways:

Streaming changes the user experience equation. The relevant metric is not "when did the complete response arrive" but "when did the first token appear." A 10-second generation where the first token arrives at 500ms feels faster to a user than a 3-second generation where nothing happens for 2.8 seconds. User perception is tied to the streaming timeline, not the total duration.

The compute profile is asymmetric. Prefill phase (processing the input prompt) and decode phase (generating output tokens) have completely different computational characteristics. Prefill is compute-bound and parallelizes well across the GPU. Decode is memory-bandwidth-bound — each new token requires loading the full KV cache from GPU memory. A 1000-token prompt and a 500-token completion have vastly different latency profiles even if the total token count is the same.

Variance is higher and more consequential. A 2x latency spike on a traditional API is a bad day. A 2x latency spike on an LLM response — from 2s to 4s TTFT — can push your p95 past the threshold where users abandon the session, and the compounding effect of queue depth during peak traffic means small spikes propagate into minutes-long backlogs.

These differences are why standard APM tools designed for request-response services miss the picture entirely. You need LLM-native metrics and a monitoring stack built around the streaming, token-by-token nature of generation.


Key Metrics Deep Dive: TTFT, TPOT, ITL, and Friends

Understanding LLM latency requires a vocabulary of metrics that map to the actual phases of inference. Here is the complete picture.

Time to First Token (TTFT)

TTFT measures the elapsed time from when the client sends a complete request to when the first generated token appears in the stream. This is the single most impactful metric for user perception.

TTFT is dominated by two factors: the prefill phase (processing the input prompt) and any queue time waiting for a free slot in the inference engine. For short prompts with no queue contention, TTFT is essentially the prefill duration. For long contexts or high-load scenarios, queue time can dominate.

Target thresholds by use case: - Chatbots / conversational AI: < 2 seconds TTFT - Coding assistants: < 5 seconds TTFT
- Batch RAG pipelines: < 30 seconds TTFT is acceptable since the user is not waiting synchronously

Time Per Output Token (TPOT)

TPOT measures the average time to generate each output token after the first one has been delivered. It is calculated as (total generation time - TTFT) / number of output tokens.

TPOT is the metric that tells you about steady-state throughput. A TPOT of 50ms means your model is generating about 20 tokens per second — sufficient for real-time streaming but on the slow side for long-form content generation. A TPOT of 15ms means roughly 67 tokens/second, which feels brisk even for extended responses.

Inter-Token Latency (ITL)

ITL is the inverse of TPOT expressed as a rate: tokens generated per second. While TPOT answers "how long per token," ITL answers "how many tokens per second." Both are useful — TPOT is easier to reason about for SLA purposes, ITL is more intuitive for capacity planning.

A healthy ITL for a mid-range model (70B parameters) on a single A100 is typically 40-80 tokens/second depending on sequence length and batching efficiency. Larger models or longer contexts push this lower.

End-to-End Latency (E2E)

E2E latency is the total time from request submission to final token delivery. This is the sum of TTFT plus (TPOT × number of output tokens). For non-streaming use cases or when measuring complete request lifecycle cost, E2E is what matters.

E2E is also the correct metric for batch workloads where the user is not waiting synchronously — you submit a prompt, you get the full response back, and the measure of success is total turnaround time.

Tokens Generated Total

The cumulative count of tokens generated. This is a monotonic counter that enables calculating tokens-per-second rates, cost accumulation (at per-token pricing), and capacity planning. Tracked as llm_tokens_generated_total in Prometheus.

First Token Duration

The histogram of TTFT values. Tracked as llm_first_token_duration_seconds in Prometheus. The buckets should be tuned to your SLA thresholds — for a chatbot targeting < 2s TTFT, your histogram should have tight buckets around 0.5s, 1s, 1.5s, 2s, 3s, 5s, 10s.

Generation Duration

The total time spent generating a complete response (from start of prefill to final token). Tracked as llm_generation_duration_seconds in Prometheus. This is distinct from E2E latency in that it measures only the inference engine's active generation time, not queue time.


Streaming vs. Batch: Latency Tradeoffs

The choice between streaming and batch processing is not just architectural — it has direct implications for which latency metrics matter and how you optimize for them.

Streaming

Streaming delivers tokens to the client as they are generated via Server-Sent Events (SSE) or WebSocket. The key advantage is perceived latency: the user sees the model "working" within the first few seconds, which dramatically reduces abandonment rates even if the total generation time is unchanged.

Streaming does introduce overhead: each token requires a network round-trip (though typically batched in chunks of 4-16 tokens to reduce overhead), and you need to handle partial response states on the client. For latency-sensitive consumer applications, this overhead is almost always worth it.

For streaming, prioritize: - TTFT as the primary SLO target (user-visible first response) - ITL/tokens-per-second as a secondary indicator of generation smoothness - Error rate during streaming (connection drops, incomplete streams)

Batch Processing

Batch processing collects multiple requests and processes them together, which amortizes the prefill cost across requests and can significantly improve throughput efficiency. The tradeoff is latency — a request in a batch waits for other requests to accumulate before processing begins.

Batch processing is appropriate for: - Asynchronous workflows (document processing, bulk summarization) - Background jobs where total runtime matters more than time-to-first-byte - Cost-sensitive workloads where throughput efficiency outweighs per-request latency

For batch processing, prioritize: - E2E latency as the primary metric (when did the full job complete) - Batch fill time (how long requests wait before a batch starts) - Throughput (tokens/second per GPU, cost per token)


Latency Optimization Techniques

Understanding metrics tells you where the problem is. These techniques address the root causes.

KV Cache and Prefix Caching

The KV cache stores the key-value attention matrices computed during the prefill phase. When a new request arrives, if it shares a prefix with a recent request (e.g., a system prompt or common instruction template), the prefill for that shared prefix can be skipped entirely, jumping straight to the new suffix.

Prefix caching reduces TTFT dramatically for workloads with common prompt templates — which describes almost every production LLM deployment. A 200-token system prompt cached once eliminates 200-token prefill overhead for every subsequent request that reuses it.

Implementation varies by inference engine. vLLM implements prefix caching via hash-based cache keys that let the engine detect and reuse cached KV blocks. The memory overhead is modest — a 70B model's full KV cache for a 2048-token sequence fits in roughly 16GB with FP16 storage — and the latency savings are substantial.

Flash Attention

Standard attention computation scales quadratically with sequence length (O(n²) in compute, O(n) in memory). Flash Attention recomputes the attention operation using tiling and SRAM-level fusion, reducing memory reads and achieving near-optimal compute utilization even at long context lengths.

Flash Attention does not change the theoretical FLOPs required, but it dramatically reduces the memory bandwidth bottleneck that dominates decode-phase latency. The practical impact: 2-4x improvement in effective memory bandwidth utilization, which translates directly to lower TPOT for longer sequences.

Most production inference engines (vLLM, TensorRT-LLM, SGLang) ship with Flash Attention enabled by default.

Chunked Prefill

Chunked prefill breaks the prefill phase into smaller chunks that can be interleaved with decode tokens in the GPU's execution schedule. Without chunking, a long prompt blocks all decode operations until prefill completes — creating a latency spike for concurrent users. With chunking, the GPU alternates between prefill and decode, maintaining responsive decode latency for existing requests even as new prompts enter the queue.

Chunked prefill is particularly valuable in multi-user serving scenarios where request sequence lengths vary widely. A 4096-token request can no longer block 64-token requests from being processed.

Paged Attention (vLLM)

vLLM's Paged Attention treats the KV cache as a virtual memory system with pages that can be allocated and reclaimed dynamically. Traditional inference engines allocate the full KV cache for max_sequence_length upfront — wasting GPU memory when sequences are shorter than the maximum. Paged Attention allocates only what is needed, enabling:

  • Higher batch utilization: More concurrent sequences per GPU
  • Longer context support: Effective context length is no longer limited by worst-case memory allocation
  • Prefix caching efficiency: Pages can be shared across sequences with common prefixes

The practical impact on latency: Paged Attention typically enables 2-3x throughput improvement at the same latency targets, or equivalent throughput at 20-30% lower latency due to reduced memory pressure.

Speculative Decoding with Draft Models

Speculative decoding uses a small, fast "draft" model to generate candidate token sequences, then uses the large model to verify multiple tokens in parallel. When the draft model is correct (which it frequently is for common patterns), the verified tokens are "free" — they appear in the output stream without waiting for the full model computation.

The speedup ratio depends on the acceptance rate — how often the draft model's predictions match the large model's. For typical conversational text, acceptance rates of 70-85% are achievable with a 7B draft model verifying a 70B main model. This yields effective throughput of 1.3-2x compared to autoregressive decoding alone.

The catch: speculative decoding increases compute cost (you run both models), so it is most valuable when you are latency-constrained rather than cost-constrained. It also requires the draft model to be architecturally compatible with the main model — typically achieved by using a truncated version of the same model family.

Continuous Batching

Continuous batching (also called dynamic batching or iteration-level scheduling) dynamically assembles batches at each token generation step rather than waiting for a full batch of requests to arrive. Unlike static batching, where a batch waits until all requests in the batch complete before new requests can join, continuous batching allows requests to enter and exit the batch at each generation step.

The result: GPU utilization stays high because there is always a full batch running, even as individual requests complete and new ones join. Average latency drops and throughput increases compared to static batching, particularly under heterogeneous request loads with varying sequence lengths.

Continuous batching is the default in vLLM and SGLang. It is one of the highest-leverage optimizations for production throughput at a given latency target.

Quantization

Quantization reduces the precision of model weights from FP16 or BF16 to INT8 or INT4, cutting memory bandwidth requirements proportionally. For memory-bandwidth-bound decode operations, this translates directly to lower TPOT.

INT8 quantization (with tools like AWQ or GPTQ) typically maintains model quality within 1-2% while providing 1.5-2x throughput improvement. INT4 quantization (with GGUF format or bitsandbytes) can push this to 2-3x throughput improvement but requires careful evaluation for accuracy-sensitive applications.

Quantization does increase prefill latency slightly (compute overhead for dequantization) and can affect output quality for certain task types. Evaluate on your specific workload before deploying to production.

Model Routing

Model routing directs requests to the most appropriate model based on request complexity. Simple factual queries might be handled by a 7B model with 50ms TTFT, while complex reasoning tasks route to a 70B model with 200ms TTFT.

The routing logic can be rule-based (prompt length, detected intent), ML-based (a lightweight classifier trained on historical request complexity), or cost-aware (always use the cheapest model that meets a quality threshold).

Model routing is less an optimization technique than a resource efficiency strategy, but for latency monitoring it introduces a new dimension: you need to track latency broken down by model to identify routing issues (simple requests unexpectedly routed to large models, or vice versa).


Monitoring Stack Setup

OpenTelemetry Instrumentation with gen_ai.* Semantic Conventions

OpenTelemetry's semantic conventions for generative AI (gen_ai.*) provide a standardized vocabulary for LLM spans and metrics. Instrument your application with the OTel SDK and auto-instrumentation:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.semconv.gen_ai import GenAiOperationName, GenAiOperationAttributes
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

resource = Resource.create({
    "service.name": "llm-api-gateway",
    "model.name": "gpt-4o",
    "gen_ai.system": "openai",
})

provider = TracerProvider(resource=resource)
trace.set_tracer_provider(provider)

# Auto-instrumentation handles LLM call interception
# Manual span creation for custom logic:
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("llm.generate") as span:
    span.set_attribute(GenAiOperationAttributes.GEN_AI_SYSTEM, "openai")
    span.set_attribute(GenAiOperationAttributes.GEN_AI_REQUEST_MODEL, "gpt-4o")
    span.set_attribute(GenAiOperationAttributes.GEN_AI_USAGE_COMPLETION_TOKENS, 512)
    span.set_attribute(GenAiOperationAttributes.GEN_AI_USAGE_PROMPT_TOKENS, 128)
    span.set_attribute(GenAiOperationAttributes.GEN_AI_RESPONSE_ID, "chatcmpl-abc123")

The key gen_ai.* attributes for latency monitoring:

Attribute Description
gen_ai.system The AI system (openai, anthropic, etc.)
gen_ai.request.model Model name for the request
gen_ai.response.model Model that actually served the request (for routing)
gen_ai.usage.prompt_tokens Token count for the input
gen_ai.usage.completion_tokens Token count for the output
gen_ai.usage.total_tokens Sum of prompt + completion
gen_ai.operation.name The operation type (chat, completion, embedding)

Prometheus Metrics

Prometheus histogram buckets for LLM latency should be tuned to your SLO targets. A chatbot targeting < 2s TTFT needs tight buckets around that threshold:

metrics:
  - name: llm_first_token_duration_seconds
    type: histogram
    description: Time to first token in seconds
    buckets: [0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 3.0, 5.0, 10.0, 30.0]

  - name: llm_generation_duration_seconds
    type: histogram
    description: Total generation time from prefill start to last token
    buckets: [0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0, 120.0, 300.0]

  - name: llm_tokens_generated_total
    type: counter
    description: Total number of tokens generated

  - name: llm_request_duration_seconds
    type: histogram
    description: End-to-end request duration including queue time
    buckets: [0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0, 120.0, 300.0]

  - name: llm_tpot_seconds
    type: gauge
    description: Time per output token (average after first token)

Query examples for common percentile calculations:

# TTFT p95 over the last 5 minutes
histogram_quantile(0.95,
  rate(llm_first_token_duration_seconds_bucket[5m])
)

# Tokens per second (ITL) averaged over 1 minute
rate(llm_tokens_generated_total[1m])

# E2E latency p99
histogram_quantile(0.99,
  rate(llm_request_duration_seconds_bucket[5m])
)

# Error rate correlated with latency spikes
rate(llm_generation_errors_total[5m]) 
  / rate(llm_requests_total[5m])

Grafana Dashboard Panels

A well-designed LLM latency dashboard surfaces three layers: real-time streaming health, percentile trends, and correlated errors.

TTFT over time (time series): Shows TTFT p50, p95, p99 with annotations when SLO thresholds are crossed. Overlay with request volume to identify whether latency spikes correlate with traffic spikes.

Token throughput (gauge + time series): Current tokens/second with trend. Useful for identifying when throughput degrades — a drop from 60 tokens/s to 40 tokens/s on the same hardware signals queue buildup or memory pressure.

Error rate vs. latency correlation (stat panel): When latency spikes, does error rate also spike? A correlation suggests queue overflow, GPU ECC errors, or throttling. A latency spike without errors suggests optimization opportunities (cache misses, prefill bottlenecks).

Queue depth (time series): The number of requests waiting for an inference slot. A growing queue during low-traffic periods indicates a throughput problem; a growing queue during peak traffic is expected but should have a defined cap.


Latency SLI Blueprint

Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for LLM latency must map to user-visible experience, not internal engine metrics. Here are the targets we use as a starting point, broken down by use case.

SLI Definitions for AI Systems

SLI Definition Measurement Point
TTFT Time from request receipt to first token in stream Client-side span, OTel gen_ai
TPOT (Total generation - TTFT) / output token count Server-side metric
E2E Latency Request receipt to final token delivery Client-side span
Generation Availability % of requests completing without error Error counter / total counter
Streaming Integrity % of streams delivering complete token sequence Client-side completion tracking

SLO Targets by Use Case

Chatbot / Conversational AI - TTFT p50: < 1.0s - TTFT p95: < 2.0s - TTFT p99: < 5.0s - TPOT: < 80ms (tokens-per-second > 12.5) - Generation availability: > 99.5%

Coding Assistant (Claude Code, GitHub Copilot) - TTFT p50: < 2.0s - TTFT p95: < 5.0s - TTFT p99: < 10.0s - TPOT: < 60ms (tokens-per-second > 16.7) - Generation availability: > 99.9% (developers are impatient)

Batch RAG / Document Processing - E2E Latency p50: < 15s - E2E Latency p95: < 30s - E2E Latency p99: < 60s - Throughput: > 50 tokens/second - Generation availability: > 99.0%

Streaming Data Extraction / Real-time Analytics - TTFT p50: < 500ms - TTFT p95: < 1.5s - ITL: > 30 tokens/second - Generation availability: > 99.9%

Error Budget Policy

For latency SLOs, a 30-day rolling error budget of 1% is a reasonable starting point. That means over 30 days, no more than 1% of requests may exceed the p95 SLO threshold. If the budget is consumed at a faster rate (e.g., > 10% of the budget in a single day), an incident is triggered.

Latency SLO violations are weighted by severity: exceeding p99 by 2x is a P1; exceeding p95 by 2x is a P2; exceeding p95 by 1.5x for more than 24 hours is a P3.


The Monitoring Stack That Makes This Operational

Building the dashboards and metrics is step one. Making them part of your operational practice is step two.

The stack that works: OpenTelemetry for instrumentation (vendor-neutral, auto-instrumentation for common LLM frameworks), Prometheus for metrics collection (native OTel support, histogram percentiles, alerting), and Grafana for visualization and alerting (Otel data source, pre-built dashboard templates, alerting rules).

For teams that want managed infrastructure without self-hosting overhead, Grafana Cloud provides hosted Prometheus and Grafana with OTel原生 support. The trade-off is cost at high ingestion volume, but the operational simplicity is significant.

For vLLM users specifically, the combination of vLLM's built-in Prometheus metrics endpoint and Grafana's dashboard templates gives you 80% of a production monitoring setup with minimal configuration. You will need to add custom spans for application-level operations (retrieval, guardrails, response formatting) to get the full request lifecycle picture.

Portkey AI provides a managed observability layer with built-in support for gen_ai.* semantic conventions, pre-built latency dashboards, and cost tracking. It is useful for teams that want turnkey LLM observability without building the OTel/Prometheus/Grafana pipeline themselves.

Baseten offers inference infrastructure with built-in latency monitoring for models deployed on their platform, including TTFT tracking and token throughput dashboards out of the box.


Conclusion

LLM latency monitoring requires metrics that map to the streaming, token-by-token nature of generation — not the atomic request-response model of traditional APIs. TTFT tells you when the user sees the first response. TPOT and ITL tell you about steady-state generation quality. E2E latency tells you about total job completion time.

Build your monitoring stack around OpenTelemetry's gen_ai.* semantic conventions, instrument Prometheus histograms with buckets tuned to your SLO thresholds, and visualize in Grafana with panels that correlate latency degradation with error rates and queue depth.

The optimization toolkit is mature: prefix caching, flash attention, paged attention, continuous batching, speculative decoding, and quantization each target a specific bottleneck. Measure first, apply the technique that addresses your bottleneck, measure again.

And set SLOs. A latency SLO without an error budget policy is a document. The error budget is what makes it operational — it tells you when to declare an incident, when to roll back a change, and when the system is drifting out of control before your users feel it.

Start with TTFT. Get that under 2 seconds for your primary use case. Everything else follows.


Related Articles