Every LLM deployment has the same trajectory: costs start small, grow unpredictably, and by the time engineering teams notice, they're presenting uncomfortable numbers to finance. The root cause is almost always the same — no one built observability into the LLM layer from the start.

Unlike traditional software services where compute costs are relatively predictable (CPU cycles, memory allocations, network egress), LLM inference costs are tied to token counts that vary with every input. A single poorly-optimized prompt that repeats context unnecessarily can multiply costs by 10x without any change in output quality.

This guide covers the complete approach to LLM cost monitoring in 2026: the metrics that matter, the tools built specifically for AI spend visibility, and the implementation patterns that give engineering and finance teams real-time insight into where every dollar goes.

Why Standard APM Tools Fail for LLM Cost

Application Performance Monitoring (APM) tools like Datadog, New Relic, and Grafana were built to track request counts, latency percentiles, and error rates. They can monitor your LLM API calls the same way they monitor any HTTP endpoint — but they have no native understanding of tokens, model versions, or the concept of input vs. output token accounting.

When OpenAI charges you $15/1M tokens for GPT-4o input and $60/1M tokens for output, a generic APM dashboard that only tracks API call count and latency tells you nothing about whether your latest prompt engineering change increased your bill by 3x or cut it in half.

LLM cost monitoring requires:

  • Token-level granularity — input vs. output, per-request breakdown
  • Model and version tracking — GPT-4o vs GPT-4o-mini vs GPT-4-Turbo all have different prices
  • User and session attribution — which customer, team, or feature generated this spend
  • Budget alert thresholds tied to actual dollar amounts, not request counts
  • Cost trend analysis across time windows, model migrations, and feature releases

The Core Metrics for LLM Cost Monitoring

Before evaluating tools, establish a metric foundation your monitoring layer must capture at minimum:

Tokens per Request (Input and Output)
Every LLM API call consumes input tokens (the prompt + context) and generates output tokens (the completion). Track both separately — they have different per-token costs, and input tokens are often where surprise costs hide (context that grows over a session, retrieval results that aren't being truncated).

Cost per 1,000 Requests
Aggregate the actual dollar cost per 1,000 API calls, broken down by model. This is your primary FinOps metric — it tells you whether you are trending in the right direction month-over-month.

Average Tokens per Output Token Ratio
If your output token count is consistently high relative to the information conveyed, you may have a prompt problem. A ratio above 0.7 for question-answering tasks (meaning you get less information out than you put in) often indicates inefficient prompting or retrieval quality issues.

Time-to-First-Token (TTFT) vs. Inter-Token Latency (ITL)
LLM serving infrastructure costs money. TTFT is dominated by prefill (processing the input context), which scales with input token count. ITL is driven by output token generation. If you are paying for long-context prefill on requests that get abandoned mid-stream, you are burning money on invisible work.

Cache Hit Rate
OpenAI's completion API and providers like Anthropic offer semantic caching. A high cache hit rate (50%+) dramatically reduces effective cost per request. Monitoring cache hit rate alongside raw request counts reveals whether caching is actually working.

Budget Burn Rate
For any given budget (daily, weekly, monthly), track the percentage consumed and the rate of consumption. If you are burning through your monthly budget at day 10, you need to know before day 30.

Building a Token Attribution Model

The most valuable LLM cost monitoring setup is one where you can answer: "How much did the X feature cost last month, and which users/teams/customers drove that spend?"

This requires propagating metadata through your entire LLM call chain:

from opentelemetry import trace
from opentelemetry.trace import SpanKind

# Track LLM calls with cost attribution metadata
def call_llm_with_tracing(model: str, input_tokens: int, output_tokens: int,
                           user_id: str, feature: str, request_id: str):
    # Per-token pricing (2026 rates, USD per 1M tokens)
    pricing = {
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "gpt-4-turbo": {"input": 10.00, "output": 30.00},
        "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
        "claude-3-5-haiku": {"input": 0.80, "output": 4.00},
    }

    input_cost = (input_tokens / 1_000_000) * pricing[model]["input"]
    output_cost = (output_tokens / 1_000_000) * pricing[model]["output"]
    total_cost = input_cost + output_cost

    # Emit to your metrics pipeline
    tracer = trace.get_tracer(__name__)
    with tracer.start_as_current_span("llm.api_call", kind=SpanKind.CLIENT) as span:
        span.set_attribute("llm.model", model)
        span.set_attribute("llm.input_tokens", input_tokens)
        span.set_attribute("llm.output_tokens", output_tokens)
        span.set_attribute("llm.cost_usd", total_cost)
        span.set_attribute("llm.user_id", user_id)
        span.set_attribute("llm.feature", feature)
        span.set_attribute("llm.request_id", request_id)

    return total_cost

The critical design decision is where to emit this cost data. Three options in order of implementation complexity:

1. Middleware/logging layer — Wrap your LLM client calls and emit structured logs (JSON) to your log aggregator. Simplest to implement, works with any provider. Query via Grafana + Loki or Elasticsearch.

2. OpenTelemetry with a cost metric — Add a custom llm.cost_usd span attribute to every LLM call. Send to a Prometheus-compatible backend via OTLP. Enables Grafana dashboards alongside your existing infrastructure metrics.

3. Dedicated AI cost platform — Use Helicone, Braintrust, or Portkey to proxy your LLM calls. These platforms handle token counting, cost calculation, and attribution automatically. More opinionated but significantly faster to operationalize.

The Tools: LLM Cost Monitoring in 2026

Helicone

Helicone is the simplest path to LLM cost observability. You point your API calls through their proxy and get token-level logging, cost attribution, and analytics with zero instrumentation changes required for most OpenAI-compatible endpoints.

What it does well: Cost per request, per model, per user — automatic. Cache analytics showing exactly which requests hit cache and how much that saved. Retention analytics to see how your cost-per-feature changes over time. The semantic cache visualization is particularly useful — you can see exactly which prompt patterns have the highest cache hit rates.

Integrations: OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Google Vertex AI, and any OpenAI-compatible API. Grafana, Datadog, and BigQuery integrations for sending data to your existing infrastructure.

Pricing: Free tier up to 100K events/month. $100/month for 500K events. Enterprise tier with custom volume pricing.

Best for: Teams that want LLM cost visibility immediately without building instrumentation. The proxy model means you can start tracking costs on existing production traffic within an afternoon.

LLMOps Helicone

Helicone gives you instant LLM cost observability — token-level attribution, cache hit analysis, and per-feature spend breakdowns with zero instrumentation. Route your OpenAI, Anthropic, or any OpenAI-compatible API through Helicone's proxy and get dashboards your finance team can actually use. Free tier covers up to 100K events/month.

Portkey

Portkey positions itself as a full AI gateway with cost management, reliability, and observability as first-class features. Beyond monitoring, it provides automatic retries, fallback chains, and load balancing across multiple LLM providers.

What it does well: Cost tracking across multiple providers in a single view — if you are running GPT-4o for some features and Claude for others, Portkey gives you unified spend dashboards. Virtual keys for per-customer or per-feature cost attribution. The budget alerting system lets you set spend thresholds at the API key level.

Integrations: 100+ LLM providers including all major ones plus open-source models. LangChain and LlamaIndex integrations for tracing. Webhook-based alerting to Slack, PagerDuty, and custom endpoints.

Pricing: Free tier with 100K tracked tokens/month. $49/month for 5M tokens. Enterprise with custom volume.

Best for: Teams running multi-provider LLM infrastructure who need unified cost attribution and reliability features (retries, fallbacks) alongside monitoring.

LLMOps Portkey

Portkey combines AI gateway reliability features (automatic retries, fallbacks, load balancing) with cost management and observability — all in one platform. Virtual keys give you per-customer cost attribution, and unified dashboards span OpenAI, Anthropic, Claude, and open-source models. The free tier is generous for small teams.

Grafana + OpenTelemetry

For teams already running Grafana for infrastructure monitoring, adding LLM cost tracking via OpenTelemetry is the most operationally coherent approach. All your observability data lives in one place.

Implementation: Use the OpenTelemetry SDK in your LLM client wrapper (as shown in the code example above) and send cost attributes as custom span metrics to Grafana Tempo or Mimir. Configure Grafana dashboards to display:

  • Cost over time by model (using the llm.cost_usd metric)
  • Token ratio trends (input/output)
  • Top users/features by spend
  • Budget burn rate

What it does well: Full control over your data. No third-party data processing. Unified observability view across infrastructure, application, and LLM metrics. Grafana's powerful alerting system works for budget alerts too.

What it doesn't do: Automatic token counting — you need to implement token tracking in your wrapper. Cache hit tracking requires additional instrumentation.

Pricing: Grafana Cloud free tier covers 10K series. Self-hosted Grafana is free. Mimir (for long-term metrics storage) is open-source. Costs are your infrastructure only.

Best for: Teams already committed to Grafana who want a single pane of glass and are willing to build the token tracking instrumentation themselves.

Observability Grafana

Grafana is the industry standard for observability dashboards — metrics, logs, and traces in one platform. Grafana Cloud's free tier covers small teams; self-hosted is free and open-source. Use Grafana as your LLM cost monitoring backend by sending OpenTelemetry span attributes with cost data from your LLM client wrapper.

Cloudflare Workers AI — Built-in Cost Analytics

If you are running models on Cloudflare Workers AI, cost monitoring is native to the platform. Cloudflare's pricing model is different from other providers — by the request, not by the token — which actually simplifies cost tracking.

Workers AI publishes per-model request counts and latency distributions in the Cloudflare dashboard. For more detailed cost analytics, use Cloudflare's analytics API to pull data into Grafana or build custom dashboards.

What it does well: Simple request-based pricing makes cost prediction straightforward. Zero infrastructure overhead. Network latency to the edge is minimal for most users.

What it doesn't do: Granular token-level attribution — you get request counts and latency, but not per-request token breakdowns without building that instrumentation yourself.

Best for: Applications where you are already on Cloudflare's edge network and want minimal operational overhead for running AI inference.

Advertisement
Advertisement

Implementing Budget Alerts

Alerts are where cost monitoring becomes financial controls. Without alerts, you will always discover cost overruns after the fact. Set these three alert tiers:

Warning Alert (80% of daily budget)
At 80% of your daily budget threshold, send a notification to your engineering lead or FinOps team. The goal is to give humans time to investigate before the budget is gone.

# Prometheus alerting rule for LLM cost budget
- alert: LLMCostBudgetWarning
  expr: sum(increase(llm_cost_usd[1h])) > 0.8 * (daily_budget_usd / 24)
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "LLM cost at 80% of daily budget burn rate"
    description: "Current hourly burn rate would consume daily budget in {{ $value | printf "%.2f" }} hours"

Critical Alert (100% of daily budget)
At 100%, you need an action. Options: route traffic to a cheaper model, enable stricter caching, or shut off non-critical features. The alert should go to whoever has authority to make that decision.

Anomaly Alert (3x normal burn rate)
Even below budget thresholds, a sudden spike in burn rate is worth investigating. A 3x increase in token volume without a corresponding traffic increase usually indicates a prompt regression, a runaway loop, or a newly deployed feature with inefficient prompting.

The Per-User Attribution Pattern

For products where end users pay for LLM usage directly (AI coding assistants, document processing tools, chat interfaces), per-user cost attribution is both FinOps and product analytics.

The implementation approach: generate a unique API key or trace ID per user session, propagate it through all LLM calls, and aggregate spend at the user level in your monitoring backend.

With Helicone's virtual keys or Portkey's key management, this becomes a configuration step rather than an engineering project. Each user or account gets a scoped API key, and the platform handles the attribution automatically.

Model Routing for Cost Optimization

Cost monitoring reveals the opportunity; model routing captures it. Once you can see exactly how much each feature and user costs, you can make intelligent routing decisions:

  • Route deterministic, short-response queries (classifications, entity extraction, simple Q&A) to smaller, cheaper models
  • Reserve expensive frontier models for tasks that actually require them
  • A/B test quality vs. cost tradeoffs before enforcing routing rules

Tools like Portkey and Cloudflare Workers AI support rule-based routing out of the box. For more sophisticated routing (LLM-as-a-judge quality gates that decide which model to use), you will need to build that logic yourself.

What Good Looks Like: A Cost Monitoring Maturity Model

Level 0 — No visibility
You get a bill at the end of the month. You have no idea which features, teams, or users drove the costs. This is where most companies start.

Level 1 — Aggregate reporting
You have dashboards showing total spend, token counts, and cost by model. Monthly review cadence. No per-feature or per-user breakdown.

Level 2 — Feature attribution
You can answer "what did Feature X cost last month?" You have logging infrastructure propagating feature metadata through LLM calls. Engineering teams have access to their own cost dashboards.

Level 3 — Real-time alerting and anomaly detection
Budget alerts fire before overruns happen. Anomaly detection catches 3x burn rate spikes within minutes. Finance has access to current-month spend data, not last-month.

Level 4 — Automated cost control
Routing rules automatically shift traffic to cheaper models when budgets are threatened. Cache hit rates above 50%. Semantic caching deployed for repeat queries. Cost per successful outcome is the primary metric, not raw token counts.