I spent most of 2025 rebuilding reliability infrastructure for AI services, and the hardest part wasn't the monitoring or the runbooks — it was convincing stakeholders that AI services need completely different SLA thinking than REST APIs. A 99.9% uptime clause sounds familiar until you realize your LLM is returning confident hallucinations 2% of the time and your vector DB is silently degrading retrieval quality without throwing a single 500 error.
This guide is what I wish I'd had when I started. It's the framework I now use with every AI infrastructure engagement: how to define SLOs that actually matter for AI services, how to calculate composite SLA math for multi-component pipelines, and the contract clauses that will save you from the model deprecation nightmare your legal team hasn't thought about yet.
Why Traditional SLA Thinking Breaks for AI
Classical SLA contracts were built for deterministic systems. Your web server either returns a 200 or it doesn't. Your database either responds within 200ms or it times out. Binary outcomes, measurable with standard tooling, enforceable through off-the-shelf uptime monitors.
AI services introduce four dimensions that break this model fundamentally.
Non-deterministic output. The same input to an LLM can produce different tokens on different calls. Your API returns 200 OK while the model produces a hallucinated answer. Traditional SLA monitoring catches the 200 but completely misses the failure mode that actually matters to your users.
Token-based cost as a reliability variable. When you're paying per token, a prompt injection attack or a runaway retry loop can turn a functioning service into a five-figure invoice in under an hour. Cost anomaly detection is as critical as latency monitoring — and it has no equivalent in classical SRE.
Multi-component latency chains. A RAG pipeline touches the LLM, the embedding service, the vector database, and the retrieval layer — each with independent latency distributions. The end-to-end latency SLO isn't the sum of the parts; it's a joint probability distribution across all of them. Get this wrong in your contract and you'll be paying credits on a service that's technically meeting its stated targets.
Behavioral drift without infrastructure failure. A model can degrade silently — retrieval recall dropping from 87% to 71% over six weeks — while every infrastructure metric looks healthy. Classical SLOs have no concept of this. Your contract needs to.
Concrete SLO Targets for AI Services
After running AI infrastructure at scale, here's the SLO set I now consider table stakes for production AI services. These aren't vendor-recommended numbers — they're targets I've seen work in real production environments serving millions of requests per day.
LLM API SLOs
Time to First Token (TTFT) p99 under 5 seconds. For synchronous chat APIs, this is the user-visible latency that drives engagement. We benchmarked TTFT p99 across GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash through a LiteLLM gateway. On warm requests (model already loaded), GPT-4o hit 3.2s p99, Claude 3.5 Sonnet hit 2.8s, and Gemini 2.0 Flash hit 1.9s. For the SLA, we set 5s p99 as the contractual ceiling with 4s as the internal SLO target — giving us headroom before we breach the contract.
Streaming availability: 99.5%. Non-streaming fallback must be available when streaming fails. We count a streaming failure as any request where no tokens are received within 10 seconds of connection establishment. Our fallback to synchronous response added 800ms median overhead — acceptable trade-off for reliability.
Token throughput: minimum 50 tokens/second for requests longer than 512 tokens. This prevents SLA gaming where vendors return fast but truncated responses.
Cost Anomaly Detection SLA
This is the SLA dimension that most teams overlook until they've been burned. Our contract specifies: cost anomaly alert within 15 minutes of spend exceeding 120% of the rolling 7-day daily average. That's a 20% spike trigger.
In practice, this requires instrumentation that most vendors don't provide out of the box. We built a lightweight cost monitoring sidecar that tracks token consumption per API key, per model variant, in 60-second windows. When the moving average exceeds the threshold, it pages the on-call engineer and automatically rate-limits new requests from the affected key. We've caught three runaway retry loops before they became four-figure incidents.
Model Behavior SLOs
These require automated evaluation infrastructure before you can contractualize them:
- Task Completion Rate (TCR): 95% — percentage of requests producing a correct, non-harmful response on our golden dataset. Measured hourly via automated eval pipeline.
- Hallucination Rate ceiling: 3% — responses flagged by our NER-based factual consistency checker against retrieved context.
- Refusal Rate floor: 0.5% floor, 5% ceiling — too few refusals suggests the model is under-refusing (safety risk); too many suggests behavior regression (quality risk).
SLA Math: Uptime Budgets and Composite Calculation
Once you have individual SLOs, you need to calculate what they mean in aggregate — and how they compose across multi-component pipelines.
Uptime Budget Basics
A 99.9% availability SLA sounds strict, but let's do the math. Over a 30-day month (43,200 minutes), 99.9% allows 43.2 minutes of downtime. That's not much for a distributed system with multiple dependencies. A 99.95% SLA gives you 21.6 minutes — even tighter.
The more useful framing is the error budget. If your SLO is 99.9% over a 30-day window, your error budget is 0.1% of 43,200 minutes = 43.2 minutes. When you've burned through 50% of that budget, you should be in incident mode, not normal operations. When you've burned 100%, you're in breach of contract.
Here's a quick reference table I use when setting internal targets:
| SLA Target | Monthly Downtime | Annual Downtime |
|---|---|---|
| 99% | 7h 18m | 3 days 15h |
| 99.9% | 43m 12s | 8h 45m |
| 99.95% | 21m 36s | 4h 22m |
| 99.99% | 4m 19s | 52m 35s |
Composite Service SLA Calculation
Here's where AI pipelines get tricky. A RAG system has at minimum three independent components: the LLM API (say, 99.9% SLA), the vector database (99.95% SLA), and the retrieval layer (99.5% SLA). The composite SLA is the product of their availabilities — assuming independent failures:
Composite = 0.999 × 0.9995 × 0.995 = 0.9935 (99.35%)
That's notably lower than any individual component SLA, and it will be a surprise to stakeholders who approved 99.9% for each service independently. I always present composite calculation upfront in the SLA negotiation to prevent scope creep mid-contract.
The formula generalizes to any number of components in series:
def composite_sla(sla_list):
"""
Calculate composite SLA for services in series.
Each SLA is a float between 0 and 1 (e.g., 0.999 for 99.9%).
"""
result = 1.0
for sla in sla_list:
result *= sla
return result
# Example: RAG pipeline
llm_sla = 0.999 # 99.9%
vector_db_sla = 0.9995 # 99.95%
retrieval_sla = 0.995 # 99.5%
composite = composite_sla([llm_sla, vector_db_sla, retrieval_sla])
print(f"{composite:.4f}") # Output: 0.9935 For parallel components — say, multiple model providers where any one can serve the request — the formula inverts. Two providers each at 99.9% give you 1 - (0.001 × 0.001) = 99.9999%. This is the mathematical basis for our multi-provider routing setup: a single provider outage doesn't breach the composite SLA.
Real-World Contract Clauses
Beyond the standard availability percentages, AI service contracts need clauses that address the specific risks of operating probabilistic, versioned infrastructure. These are the five that have saved us the most grief.
Model Deprecation Notices
Model providers deprecate versions with as little as 30 days notice. Our contract specifies: minimum 90-day deprecation notice for any model version currently serving production traffic; minimum 180-day notice for models with no published replacement timeline. We learned this the hard way when GPT-4-0613 was deprecated with 6 weeks notice and we had not yet qualified a replacement.
The clause also specifies that during the deprecation window, the provider must continue serving the old model at the contracted SLA. No sneaky degradation while you're migrating.
Version Support Windows
When a new model minor version drops — say, Claude 3.5 Sonnet v1.20250101 to v1.20250401 — you need a support window for testing and qualification. Our contracts specify: new model versions made available at least 14 days before the previous version enters deprecation; any breaking changes to API contract require 30-day notice and a compatibility shim period.
This prevents the situation where a provider drops a new API version with changed response schemas and your RAG pipeline starts returning garbage because no one planned for the migration.
Latency Credit Schedules
Standard SLA credits for downtime are well-understood. Latency credits are murkier and need precise definition. Our contracts specify:
- TTFT p99 exceeds SLO target by up to 25%: 5% service credit
- TTFT p99 exceeds SLO target by 25-50%: 15% service credit
- TTFT p99 exceeds SLO target by more than 50%: 25% service credit and a root cause analysis within 5 business days
The key is measuring methodology. We require that latency be measured at the 50th percentile request volume point, excluding warm-up requests and requests that trigger rate limits. Both parties need to agree on the measurement endpoint — we use a dedicated latency probe endpoint that returns a 20-token response.
Behavioral Regression Clauses
This is the clause that separates AI-aware contracts from boilerplate vendor agreements: if automated evaluation scores on the contracted golden dataset drop more than 2 percentage points below baseline, vendor must deliver a root cause analysis within 10 business days and either restore performance or provide a credit schedule for affected traffic.
Without this clause, a vendor can push a model update that quietly tanks your task completion rate and you'll have no contractual recourse — because all your infrastructure metrics look fine.
SLOs for RAG Pipelines
RAG introduces reliability dimensions that don't exist in traditional LLM serving. The retrieval component has its own quality surface — separate from, but coupled to, the model's generation quality.
Retrieval Recall Targets
Recall measures whether your retrieval system returns the relevant documents. We target 85% recall@10 on our evaluation dataset — meaning the relevant document appears in the top 10 results 85% of the time. Below 80%, user satisfaction scores in our A/B tests dropped sharply.
Monitor this with a fixed eval dataset of 500 query-document pairs, sampled quarterly from production traffic. Calculate recall@10 weekly and alert if it drops below the 80% floor.
Context Utilization Rate
Context utilization measures how much of the retrieved context the model actually uses in its response. Low utilization (below 60%) typically means either the retrieved documents are irrelevant to the query or your chunking strategy is misaligned with the model's context window.
Track this by comparing retrieved context tokens against the model's actual output references — we use an attention attribution proxy that flags when the model generates tokens that don't correlate with high-attention spans over the provided context. Target: 70%+ context utilization on factual QA workloads.
Answer Faithfulness
Faithfulness measures whether the model's answer is supported by the retrieved context — the anti-hallucination metric. We target 90%+ faithfulness on our evaluation set.
Measurement requires a factual consistency checker — we use an NER-based extraction pipeline that identifies claimed entities in the response and cross-references them against the retrieved documents. If more than 10% of entities in a response are unsupported, that response is flagged as unfaithful.
This is harder to contractualize with a vendor since it depends on your retrieval quality, your chunking strategy, and your prompt design — but the same faithfulness SLO applies to any hosted RAG API you consume.
Putting It Together
The framework I use for every AI service agreement starts with three questions: What can we measure automatically? What can we contractualize credibly? What requires a human-in-the-loop judgment call?
For the measurable parts — availability, latency, cost anomalies, retrieval recall — define precise thresholds, measurement methodologies, and credit schedules before you sign. For the behavioral parts — faithfulness, hallucination rates — invest in the evaluation infrastructure first, then contractualize the results. You can't hold a vendor to a standard you can't measure.
The composite SLA math is non-negotiable to show stakeholders. A 99.9% availability promise for each component sounds reasonable until you calculate the composite for a five-component pipeline and realize your actual reliability commitment is 99.5%. Better to set expectations correctly upfront than explain to a customer why their RAG pipeline breached SLA even though every individual service was within spec.
AI reliability is still the wild west — the tools, the metrics, and the contractual norms are all evolving fast. The teams that get this right are the ones treating AI SLOs as a first-class engineering problem, not a legal afterthought. Build the measurement infrastructure, define the SLOs, run the composite math, and negotiate the clauses before you need them. That's the only way to operate AI services with the confidence that comes from knowing exactly what you're promising — and exactly how you'll know if you break it.