What "tokenmaxxing" actually means
On 2026-06-06, The New Stack ran a feature story on Lanai's new Token Tuner product under the headline "Tokenmaxxing is real, expensive & it's spreading: AI budgets are exploding." The piece formalizes a pattern that finance and platform teams have been quietly complaining about since late 2025: enterprises are hemorrhaging money on chat-style LLM usage — Claude.ai, ChatGPT Team, GitHub Copilot Chat, internal agentic tools — but their FinOps tooling only shows per-model or per-user totals, not per-workflow. The unit of spend that the CFO actually cares about is the business process: customer support, sales research, legal review, code review, claims triage, knowledge-base synthesis. None of those map cleanly to a model or a user. They map to a workflow.
The framing Lanai is pushing, and that the NewStack editorial amplified, is that the next layer of LLM FinOps is per-workflow attribution. You route every call through a gateway that stamps a workflow_id label on it, aggregate cost by that label, and then make routing decisions per workflow. A customer-support workflow can run on Haiku; a sales-research workflow needs Sonnet. Without per-workflow attribution, you cannot make that call with any rigor — you are arguing with the CFO about vendor invoices, not with a dashboard about which workflow is the cost driver.
This article is the implementation wedge for that wedge. The four existing FinOps articles on this site (LLM FinOps strategies, LLM cost monitoring tools, AI coding agent FinOps, and multimodal cost optimization) all cover how to reduce cost within a workflow. None of them covers how to identify which workflows are the cost drivers in the first place. That is the layer below the per-feature cost table, and it is the one that enterprise budget conversations actually start at.
1. Why per-model and per-user attribution both fail for the budget conversation
The standard LLM cost breakdown that every observability tool ships with looks like this:
- By model: $18,400 on GPT-4o, $9,200 on Claude Sonnet, $4,100 on Claude Haiku, $2,800 on Gemini Flash, $1,200 on OpenAI embeddings, $640 on self-hosted Qwen-Coder.
- By user / engineer / seat: the top 20% of users generate 80% of the spend (a Pareto distribution that holds across every team I have seen instrumented).
Both tables are technically accurate and both are useless to the VP of Engineering or the CFO in a budget meeting. The conversation immediately goes sideways:
- "Why is GPT-4o our biggest line item?" — The model breakdown does not say whether GPT-4o is being used for high-value work (legal review, contract analysis) or low-value work (chat playground, internal tooling experimentation). The dollar number does not differentiate.
- "Why is Engineer A spending $2,000 a month?" — The user breakdown does not say whether Engineer A is doing critical customer-facing work or burning compute on agent loops. The dollar number does not differentiate.
The question that actually needs an answer, every single quarter, is: which business process is burning the budget, and is that process producing more business value than the cost it consumes? Per-model and per-user attribution both fail this question. Per-workflow attribution is the layer that answers it.
The architectural reason it fails is that "model" and "user" are not the dimensions a business process lives in. A customer-support workflow can be executed by twenty different engineers, on three different models, against four different prompt templates, with caching on some calls and not others. The model breakdown averages across the workflow. The user breakdown averages across the workflow. The workflow is the missing join key.
2. The implementation pattern: tag every call with a workflow_id
The pattern is the same one Prometheus uses for high-cardinality labels: stamp every event with the label you want to aggregate by, and let the time-series database do the grouping. For LLM calls, the label is workflow_id (or workflow, or business_process — pick one, stick to it). The plumbing looks like this:
# Pseudo-code: emit a per-workflow cost signal on every LLM call.
# The workflow_id is the only new label the rest of the stack needs.
def call_llm(prompt, *, model, workflow_id, user_id):
response = client.chat.completions.create(model=model, messages=prompt)
usage = response.usage
cost = (
(usage.prompt_tokens / 1_000_000) * PRICING[model]["input"]
+ (usage.completion_tokens / 1_000_000) * PRICING[model]["output"]
)
# Emit a counter per workflow_id, per model.
metrics.incr("llm.cost_usd", value=cost,
tags={"workflow_id": workflow_id, "model": model})
metrics.incr("llm.tokens", value=usage.total_tokens,
tags={"workflow_id": workflow_id, "model": model,
"direction": "input"})
metrics.incr("llm.tokens", value=usage.completion_tokens,
tags={"workflow_id": workflow_id, "model": model,
"direction": "output"})
return response
The non-negotiable design rule: the workflow_id must be set by the caller, not inferred. If you try to infer workflow from the prompt, the model, the user, or the timestamp, you are doing post-hoc classification, which is brittle and expensive. The application code that initiates the LLM call is the only place that knows "this call is part of the customer-support workflow, ticket #4821." That is where the label gets set.
Once every call carries a workflow_id, the aggregation is a one-line query in whatever backend you use — Prometheus, BigQuery, Snowflake, ClickHouse, even a CSV in pandas:
# Total cost per workflow, last 30 days.
sum by (workflow_id) (
increase(llm_cost_usd_total[30d])
)
That single table is the artifact the CFO and VP Eng have been asking for. It answers "where is the money going?" in the unit they think in: customer support, sales research, code review, claims triage, internal tooling, marketing copy. Everything else in the per-workflow cost stack is detail.
3. Three concrete integrations: OpenLLMetry, Helicone, and Portkey
The pattern is identical across all three tools — the difference is whether the workflow_id rides as an OpenTelemetry attribute, a Helicone custom property, or a Portkey metadata field. All three already support custom-property pass-through; you do not have to wait for a feature.
OpenLLMetry (OpenTelemetry for LLMs)
OpenLLMetry is the open-source OpenTelemetry instrumentation layer maintained by Traceloop. It already supports custom span attributes on every LLM call. The workflow_id rides as a span attribute and ends up in your existing OTel pipeline — no separate vendor, no parallel backend.
from opentelemetry import trace
from openllmetry.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument()
tracer = trace.get_tracer(__name__)
def call_llm(prompt, *, model, workflow_id, user_id):
with tracer.start_as_current_span("llm.call") as span:
span.set_attribute("workflow_id", workflow_id)
span.set_attribute("user_id", user_id)
span.set_attribute("llm.model", model)
response = client.chat.completions.create(model=model, messages=prompt)
span.set_attribute("llm.cost_usd", total_cost)
return response
The workflow_id ends up on the span. From there it flows into your existing observability backend (Tempo, Jaeger, Honeycomb, Datadog APM) and can be exported to your metrics store. The downside: you have to write the cost calculation yourself, and you do not get a pre-built dashboard. The upside: zero vendor lock-in, full control over labels, free.
Helicone gives you per-workflow cost dashboards out of the box. Route your OpenAI, Anthropic, or any OpenAI-compatible API call through the Helicone proxy, pass workflow_id as a custom property on every request, and the dashboard aggregates cost by that label automatically. Cache hit analytics, per-workflow budget caps, and Slack/PagerDuty alerting on burn-rate anomalies are included. Free tier covers 100K events/month — enough for a small team to start with one workflow, prove the pattern, then expand.
Portkey
Portkey is the AI gateway layer that sits between your application and your LLM provider. It exposes a metadata field on every request that is pass-through to logs, traces, and cost analytics. The workflow_id rides as a metadata key, and the per-workflow dashboard is one of the built-in views in the Portkey console.
from portkey_ai import Portkey
client = Portkey(api_key="...")
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
metadata={
"workflow_id": "customer_support",
"ticket_id": "4821",
"user_id": "u_882",
},
)
Portkey also gives you routing logic on the workflow_id. You can declare in config: "for workflow customer_support, prefer Claude Haiku; fall back to GPT-4o-mini; cap at $0.05 per request." That routing config is the natural follow-on to the per-workflow attribution — once you can see cost by workflow, you want to control cost by workflow, and the gateway is the right place to do it.
Portkey is the AI gateway that turns per-workflow cost attribution into per-workflow cost control. Pass workflow_id as metadata on every request, and you get a built-in per-workflow dashboard, configurable routing rules (workflow X prefers model Y at budget Z), automatic fallbacks when caps are hit, and unified cost reporting across OpenAI, Anthropic, and any OpenAI-compatible provider. Generous free tier for small teams.
4. The model-downgrade-by-workflow pattern
Once you have per-workflow attribution, the obvious follow-on optimization is to downgrade the model for workflows that do not need the frontier tier. This is the part where per-workflow attribution pays for itself in days, not months.
The pattern in practice:
| Workflow | Old model | New model | Cost reduction | Quality bar |
|---|---|---|---|---|
| Customer support (Tier 1 triage) | GPT-4o | Claude Haiku 4.5 | ~85% | Acceptable — well-bounded prompts, structured output |
| Internal chat / playground | GPT-4o | GPT-4o-mini | ~95% | Acceptable — exploration, not customer-facing |
| Sales research (RAG over CRM) | GPT-4o | Claude Sonnet 4.5 | ~30% | Acceptable — quality matters, but Haiku misses the nuance |
| Legal contract review | Claude Sonnet 4.5 | Claude Opus 4 (with cache) | +200% (intentional) | Required — quality gate, low volume |
| Code review (PR diffs) | GPT-4o | Self-hosted Qwen-Coder-32B | ~70% | Acceptable for first-pass, escalation to frontier on flagged findings |
| Marketing copy generation | Claude Sonnet 4.5 | Claude Haiku 4.5 | ~85% | Acceptable — humans edit before publication |
Two structural details that make this pattern work:
- The downgrade is per-workflow, not per-prompt. A single workflow should run on a single model tier. The moment you start A/B-ing models within a workflow, your quality story becomes a statistical argument and your cost story becomes a debugging session. Pick a tier per workflow, monitor quality with a few hundred labeled examples, and ship the change.
- Quality bars are workflow-specific, not model-specific. A 95% accuracy on customer-support triage is excellent. A 95% accuracy on legal review is unacceptable. The quality bar lives in the workflow definition, not in the model documentation. Track it the same way you track cost: with the
workflow_idlabel, on a per-workflow dashboard.
For teams that have not yet done per-workflow attribution, the model downgrade is effectively impossible. You can downgrade globally (which is what most teams do first, when the bill gets attention), but that hurts the workflows that genuinely need the frontier tier. Per-workflow attribution unlocks the granular optimization.
5. Langfuse for per-trace cost and the self-hosted path
Helicone and Portkey are the gateway layer — they own the request path. Langfuse is the trace layer — it records every LLM call as a span and lets you query cost per trace, per session, per user. The workflow_id rides as a tag on the trace, and Langfuse's analytics surface aggregates it for you. For teams that already have a gateway choice and want a dedicated cost-and-trace layer on top, Langfuse is the strongest open-source option in 2026.
The implementation is a few lines of the Langfuse SDK. The key call is langfuse.update_current_observation(metadata={"workflow_id": ...}) — once that runs, every cost, latency, and quality metric in the Langfuse UI is filterable by workflow. You get the per-workflow dashboard as a built-in view, and you can export the underlying data to your warehouse for the join with billing data.
The self-hosted path is the reason Langfuse keeps showing up in 2026 stack discussions. It is Apache-licensed, it runs in a single Docker container, and it stores everything in your own Postgres. For teams with compliance or data-residency requirements — the same teams who are deploying on-prem or VPC-hosted models — Langfuse is the only one of the three that does not require sending LLM call metadata to a third-party vendor.
Langfuse is the open-source LLM observability layer with the strongest per-trace cost attribution in 2026. Pass workflow_id as a metadata tag on every observation, and the analytics surface aggregates cost, latency, and quality by that label automatically. Apache-licensed, self-hostable in a single Docker container, Postgres-backed. For teams that need per-workflow cost visibility without sending metadata to a third-party gateway, Langfuse is the most flexible foundation in 2026.
6. Wiring per-workflow cost into Grafana
For teams already running Grafana for infrastructure observability, the destination for the per-workflow cost data is the same dashboard layer. The integration has three parts: a metrics backend that understands high-cardinality labels, a query that aggregates by workflow_id, and a dashboard that puts the table next to the existing infrastructure cost panels.
- Metrics backend. Prometheus handles this fine for moderate cardinality (a few hundred to a few thousand distinct workflows). Above that, the high-cardinality workhorses — VictoriaMetrics, Mimir, or ClickHouse with a Grafana data source — scale further. The
workflow_idrides as a Prometheus label, with the usual cardinality caveat: every distinct value is a new time series, so do not generate newworkflow_idvalues per request (a UUID-per-call is the wrong pattern; a stable workflow identifier is the right one). - Query. The aggregation is the same one-liner from Section 2, scoped to whatever time window the budget conversation is in:
# Per-workflow cost, last 30 days, sorted.
topk(20, sum by (workflow_id) (increase(llm_cost_usd_total[30d])))
- Dashboard. A bar chart sorted by cost, a per-workflow time series showing burn rate, and a single-stat panel showing the share of total LLM spend held by the top three workflows. That is the budget-conversation view. The other panels (latency, error rate, token ratios per workflow) are for the engineering team, not for the CFO.
The dashboard side of this is a one-day build for a Grafana-literate SRE. The harder part is making sure the workflow_id is set consistently across every LLM call site in the codebase. That is a one-month code-quality and instrumentation push, and it is the reason this article leads with the pattern instead of the tooling.
Grafana is the de facto dashboard layer for per-workflow AI cost monitoring. Send OpenTelemetry cost attributes from your LLM gateway (Helicone, Portkey, or self-hosted OpenLLMetry) into Grafana, and you get per-workflow cost dashboards next to your existing infrastructure metrics. Alert on per-workflow burn-rate anomalies the same way you alert on any other infra metric. Free tier covers small teams; self-hosted is free and open-source.
7. A 30-day rollout for per-workflow attribution
If you are starting from zero, the order matters. Compress the work into a single sprint if your tooling is already in place, or plan it as a 30-day project if you are still choosing between gateway options.
Week 1 — Inventory the workflows. The first step is not code. It is a one-page document: a list of the distinct business processes that initiate LLM calls in your organization. Sales research. Customer support. Code review. Internal chat. Marketing copy. Legal review. Knowledge-base synthesis. For each workflow, name an owner, name a quality bar, name a current model. The document is the source of truth for the workflow_id values that the code will use.
Week 2 — Instrument the calls. Add the workflow_id label to every LLM call site. The label is set in the application code, not inferred, not guessed. If the code is in three languages, do all three. If the code is spread across 200 microservices, prioritize the 20 microservices that account for 80% of the LLM spend — the long tail can be batched.
Week 3 — Stand up the gateway or the proxy. Route the instrumented calls through Helicone, Portkey, OpenLLMetry, or your existing observability stack. The workflow_id rides as a label, a metadata field, or a span attribute depending on the tool. By the end of the week, you should have a working per-workflow cost query returning real numbers from real traffic.
Week 4 — First model downgrade. Pick one workflow where the quality bar is well-bounded (structured output, narrow task, no edge cases) and downgrade the model. The cost delta should be visible in the per-workflow dashboard within 24 hours. If the quality bar is still met, the optimization is in production. If not, you have a one-line revert and a labeled evaluation set for the next attempt.
After 30 days, you will have the attribution layer that the budget conversation has been asking for, the first workflow downgrade in production, and the data foundation to make the next 5-10 downgrades with confidence. The CFO meeting in Q3 will be a different meeting than the one in Q2.
8. The attribution gap that the NewStack editorial surfaced
The reason the Lanai / NewStack "tokenmaxxing" framing matters is not that the problem is new — enterprise FinOps teams have been complaining about LLM cost since GPT-4 launched. It is that the conversation has been stuck on the wrong layer. Per-model and per-user attribution have been the only options in the standard observability tools, so the budget conversation has been stuck on questions that those layers can answer: "are we using the right model for this feature?" (yes) and "are we over-licensing seats?" (sometimes). Neither question is the one the CFO is asking.
The question the CFO is asking is "which business process is generating the spend, and is the value of that process worth the cost?" That question requires per-workflow attribution. The implementation is not hard — the workflow_id label is a few lines of code per call site, and the gateway or proxy layer that aggregates it is already a solved category in 2026 (Helicone, Portkey, Langfuse, OpenLLMetry). The reason it has not been done at most companies is that nobody has put the workflow dimension on the standard cost dashboards. The Lanai Token Tuner product is the first commercial attempt to put it there. The open-source and gateway-based alternatives are the faster path for most teams.
The companion pieces on this site cover the per-engineer, per-model, per-feature, and per-modal attribution layers. This article covers the layer below those — the per-business-process layer that actually maps to the org chart. Pairing the two gives the complete attribution stack: per-workflow (which business process is the cost driver) → per-feature (which product surface is the cost driver within the workflow) → per-model and per-user (which model and which user generated the cost within the feature). The full stack is what an enterprise budget conversation needs in 2026.
This piece builds on the FinOps wedge established in our LLM FinOps 2026 guide, the tooling deep-dive in LLM Cost Monitoring Tools, the per-engineer and per-PR attribution in AI Coding Agent FinOps, and the per-modal cost pattern in Multimodal LLM Cost Optimization.