Every software incident follows a pattern: something breaks, alert fires, engineer investigates, fix is applied, postmortem is written. The playbook is well-established. LLM incidents break the playbook in ways that trip up even experienced SREs. The failure modes are different. The investigation approach is different. The fix is rarely a one-line code change. And the postmortem needs a different structure entirely.
This article is a practical guide to AI incident postmortems built from three real production failures: a legal RAG system that started hallucinating case citations, a medical AI triage assistant that stopped refusing high-risk queries, and a customer support chatbot that began escalating everything to human agents for no discernible reason. Each case study covers what went wrong, how the team diagnosed it, what they fixed, and the framework they use now for every AI incident.
Why AI Incidents Require a Different Postmortem
Traditional software postmortems ask "what broke?" and "how do we fix it?" AI incidents ask additional questions that traditional systems do not pose: Was the model working correctly even if the output was wrong? Has the model's behavior changed even if nothing in the system changed? Is this a one-off hallucination or a systematic regression?
The non-deterministic nature of LLM outputs means that the same input can produce different outputs at different times, depending on model version, context window state, temperature, and randomness in the sampling process. A traditional incident where the same input always produces the same failure is straightforward to debug. An AI incident where the same input produces a correct answer most of the time but an incorrect answer some of the time requires a fundamentally different diagnostic approach.
The AI postmortem framework we use answers four questions, in order:
- What was the output, and what should it have been? Establish the gap between observed and expected behavior.
- What changed — in the model, the data, the pipeline, or the traffic? Identify the proximate cause.
- Is this an isolated incident or a systematic regression? Determines scope of fix.
- What is the detection gap that allowed this to reach production users? Prevents the next one.
Case Study 1: The Legal RAG Citation Hallucination
What happened
A law firm's RAG-based research assistant began returning fabricated case citations in its responses. The citations looked legitimate — proper case names, court names, docket numbers — but legal research showed they did not exist. Partners were preparing arguments based on non-existent precedents. The system had been working correctly for eight months before the failure began.
Diagnosis
The investigation started with the standard question: what changed? But in this case, nothing had changed in the application code, the retrieval pipeline, or the model deployment in eight months. The failure was invisible in traditional monitoring — no error codes, no latency spikes, no request failures. The model was generating confident, well-formulated responses that happened to be wrong.
The team applied the four-question framework:
1. What was the output vs. what should it have been? The system returned case citations that did not exist in the legal database. The correct behavior was either returning a real citation or explicitly stating that no relevant case law was found.
2. What changed? Nothing in the application layer. But the legal document corpus had been updated with 14,000 new court filings six weeks earlier. The embedding model had not been retrained on the new documents, meaning queries were being matched against a vector index that did not include the updated corpus. The model, unable to find relevant context in retrieval, began generating plausible-sounding but fabricated citations.
3. Is this isolated or systematic? Systematic. The embedding model was out of sync with the document corpus. Every query that depended on the updated corpus was at risk of hallucinated citations.
4. What was the detection gap? The system had no retrieval quality monitoring. There was no metric tracking the relevance scores of returned chunks, no alert for when average relevance dropped below a threshold, and no evaluation pipeline that checked output accuracy against ground truth.
The fix
Immediate: the embedding pipeline was updated to re-index the entire document corpus. The retrieval quality monitoring was added: a metric tracking the average cosine similarity between query embeddings and returned chunks, with an alert when the rolling average dropped below 0.72 (the team's empirically-determined threshold for reliable citation generation).
Long-term: a weekly evaluation job was added that runs a set of 50 known-legal-question pairs against the RAG pipeline and measures citation accuracy. Any accuracy drop below 95% triggers an alert and disables citation generation until the pipeline is re-evaluated.
Case Study 2: The Medical AI Triage Refusal Failure
What happened
A medical AI triage assistant that normally refused high-risk queries (suggesting the patient seek immediate care) began providing detailed medical advice for those same high-risk presentations. The system's refusal rate for chest pain queries dropped from 100% to under 5% over a 72-hour period. No code changes had been deployed. The model version was unchanged.
Diagnosis
The team initially suspected a prompt injection attack — an adversarial input designed to override the model's safety instructions. The investigation found no evidence of adversarial inputs in the traffic logs during the failure window. The actual cause was more mundane and more alarming: a routine model version update had been applied to the base model 96 hours before the failure window, and the new version had different refusal behavior characteristics that the system's evaluation pipeline had not caught.
Applying the four-question framework:
1. What was the output vs. what should it have been? The model was providing medical advice for high-risk presentations instead of refusing. The correct behavior was refusal with a suggestion to seek immediate care.
2. What changed? A model version update had been applied to the base inference endpoint. The new version had subtly different refusal behavior — it still refused the most egregious cases but would engage with high-risk queries framed in specific ways. The evaluation suite used to validate model updates had not included adversarial framing tests, so this behavior change slipped through.
3. Is this isolated or systematic? Systematic for queries in the high-risk category that used specific framing patterns. Estimated 0.3% of the query volume was affected, but the risk was severe.
4. What was the detection gap? The evaluation pipeline ran on a weekly cadence and used a static test set. There was no production-time monitoring of refusal rate by risk category, no statistical alert for unexpected changes in refusal behavior, and no canary evaluation on live traffic before full deployment.
The fix
Immediate: the model version was rolled back to the previous version pending a full evaluation. A retrospective analysis identified 847 queries during the failure window that should have received refusals but did not — all were manually reviewed. No adverse patient outcomes were found in this specific case, but the risk was unacceptable.
Long-term: the evaluation pipeline was overhauled. Risk-category refusal rates became a production metric with a statistical process control alert: if the refusal rate for high-risk categories moves more than two standard deviations from the rolling baseline, an alert fires and the model version is automatically held pending review. A mandatory canary evaluation period (24 hours with 5% traffic shadow) was added before any model version update goes to full production.
Case Study 3: The Customer Support Escalation Storm
What happened
A customer support AI chatbot began escalating 100% of queries to human agents — including simple queries like "what is my order status?" The system had been working correctly for three months before this incident. No changes to the application code, the routing logic, or the model configuration.
Diagnosis
This was the most confusing incident to diagnose because the failure mode was the opposite of what the team expected: the model was performing too much escalation rather than too little. The escalation rate had gone from an expected 8-12% to 100% within a 6-hour window.
Applying the four-question framework:
1. What was the output vs. what should it have been? The model was routing every query to human agents, regardless of complexity. The correct behavior was routing simple queries to the automated system and escalating only complex or sensitive queries.
2. What changed? Traffic analysis revealed a subtle shift in the query distribution: a competitor had published a blog post 48 hours earlier that was driving traffic to the company's website. The new traffic had a different query pattern — more general informational queries with no account context — that the routing classifier was interpreting as higher complexity than the queries it had been trained on. The classifier had not been retrained for this query distribution, so it was defaulting to escalation for everything it couldn't confidently classify as simple.
3. Is this isolated or systematic? Systematic for the new traffic segment. The classifier was operating correctly for the original query distribution but was failing for the new distribution introduced by the competitor's blog post.
4> What was the detection gap? There was no monitoring of escalation rate by query type, no alerting for sudden shifts in escalation rate, and no traffic pattern monitoring that would have caught the shift in query distribution before it cascaded into full escalation.
The fix
Immediate: the routing classifier's confidence threshold was temporarily lowered to force more queries through the automated path while the team investigated. The query distribution shift was identified, and a targeted retraining of the classifier on the new query patterns was completed within 48 hours.
Long-term: escalation rate was broken into sub-metrics by query category, with independent alerting for each category. A traffic pattern monitor was added that tracks the distribution of incoming queries and alerts on statistically significant shifts that might indicate a new traffic source requiring classifier retraining.
The Four-Question AI Incident Framework
Use this framework for every AI incident. It is designed to work for non-deterministic systems where the failure is in the output quality rather than in the system availability.
Question 1: What was the output and what should it have been?
This sounds straightforward but requires precision for AI incidents. "The answer was wrong" is not sufficient. The team needs to establish exactly what output the system produced versus what output the system should have produced. For a RAG system: did it return a fabricated citation, miss relevant context, or provide an answer that was technically correct but incomplete? For a classifier: did it assign the wrong category, the wrong confidence score, or both? The specificity matters because it determines the diagnostic path.
Capture three things: the actual output, the expected output, and the confidence level of the output. An LLM-generated answer with 95% confidence that is wrong is a more serious signal than an answer with 40% confidence that is wrong. High-confidence errors point to model behavior regressions. Low-confidence errors often point to retrieval or context assembly problems.
Question 2: What changed?
For traditional software, this is usually a code change, a deployment, or a configuration update. For AI systems, the list of possible changes is longer and includes things that traditional monitoring does not track:
- Model version update (even if the update was supposed to be behavior-neutral)
- Embedding model retraining (causes retrieval pipeline changes)
- Document corpus or knowledge base updates (causes retrieval context changes)
- Query distribution shift (changes the kinds of inputs the system sees)
- Context window state changes (affects stateful LLM calls)
- Temperature or sampling parameter changes
- Retrieval pipeline changes (chunking strategy, embedding model, vector database index)
- 上游 API behavior changes (when using external LLM providers)
Start with the most common causes: model version changes and retrieval pipeline changes. Work outward from there. Traffic analysis for query distribution shifts should be a standard part of any AI incident investigation.
Question 3: Is this isolated or systematic?
For AI incidents, an isolated incident is rare. Hallucinations that appear to be isolated are often symptoms of a retrieval or model behavior issue that is affecting a larger percentage of queries at lower confidence levels. The investigation should always include a systematic scan: run a representative sample of recent queries through the system and check the output quality against expected behavior. If the error rate in the sample is significantly above baseline, the incident is systematic, not isolated.
The threshold for "systematic" depends on the risk profile of the application. For a medical AI system, a 0.1% error rate is systematic and requires immediate intervention. For a creative writing assistant, a 5% hallucination rate might be within acceptable bounds. Define these thresholds in your SLOs before incidents happen.
Question 4: What was the detection gap?
Every AI incident that reaches production users reveals a gap in the monitoring or evaluation stack. The answer to this question is what prevents the next incident. The detection gap is usually one of three categories:
Quality signal missing: The system had no metric for the quality dimension that failed. For the legal RAG system, there was no retrieval relevance score metric. For the medical AI, there was no risk-category refusal rate metric. You cannot alert on what you do not measure.
Evaluation lag: The evaluation pipeline ran on a weekly or monthly cadence, but the failure happened between evaluation runs and was only caught when users reported it. Production-time quality monitoring closes this gap.
Statistical process control missing: The system was monitored for absolute thresholds (error rate above X%) but not for statistical changes in behavior (error rate has shifted Y standard deviations from baseline). Statistical process control catches gradual regressions that stay below absolute alert thresholds but represent meaningful behavior changes.
The AI Incident Runbook Template
Use this runbook for every AI incident response. It is designed to complement your existing incident response process, not replace it.
Phase 1: Triage (0-15 minutes)
- Confirm the failure is real: run the specific query that was reported through the system and verify the output
- Establish scope: run a sample of recent queries and measure error rate vs. baseline
- Check for model version changes in the past 7 days
- Check for retrieval pipeline changes in the past 7 days
- Check for traffic distribution shifts in the past 48 hours
- Identify the affected output category and the risk severity
Phase 2: Scope (15-60 minutes)
- If systematic: consider disabling the affected capability (citation generation, medical advice, escalation) pending investigation
- Pull the full query log for the affected time window and characterize the failure rate by query type
- Check the retrieval pipeline's relevance metrics for the affected query types
- If model version change is suspected: initiate rollback evaluation
- If retrieval issue is suspected: check embedding index freshness and chunk relevance scores
Phase 3: Mitigate (1-4 hours)
- Apply the fix: rollback model version, refresh retrieval index, retrain classifier, or adjust routing thresholds
- For high-risk systems: manually review affected queries from the failure window
- Establish a monitoring window: watch quality metrics for 48 hours post-fix
Phase 4: Post-incident (24-72 hours)
- Complete the four-question postmortem
- Identify the detection gap
- Define the new monitoring metric or threshold that would have caught this
- Schedule the monitoring work in the next sprint
- Update the evaluation test suite to include this failure pattern
What Your AI Incident Response Stack Needs
Based on the patterns from these three case studies, the monitoring stack for AI incident response needs four capabilities that are not part of traditional application monitoring:
Production quality sampling: A continuous sample of production queries routed through an evaluation pipeline that checks output quality against expected behavior. The sample rate can be low (1-5%) for high-volume systems but must be statistically representative of the query distribution. Store prompt/response pairs for the sample with evaluation scores attached.
Risk-category metrics: Separate monitoring for different risk tiers. Medical advice queries, legal citation queries, and customer data queries each need their own quality metrics and alerting thresholds. A single "hallucination rate" metric across all query types obscures the risk concentrations.
Behavioral change detection: Statistical process control on quality metrics — not just absolute thresholds. An error rate that is stable at 0.5% and then shifts to 0.8% is a meaningful signal even if it stays below any absolute alert threshold you would set. Use control charts on rolling windows of quality metrics and alert on statistically significant shifts.
Retrieval quality monitoring: If your application uses RAG, the retrieval pipeline's quality directly determines output quality. Monitor chunk relevance scores, context utilization rates, and embedding drift. A retrieval quality degradation will always precede an output quality degradation — use it as the leading indicator.
Arize AI provides production-grade evaluation with automatic drift detection, statistical process control alerts, and prompt/response tracing that makes AI incident investigation fast instead of painful.
Conclusion
AI incidents are harder to diagnose than traditional software failures because the failure is in the output rather than in the system behavior. A model that is working correctly can produce wrong answers, and a model that is producing wrong answers may show no errors in traditional monitoring systems. The only way to catch these failures is to measure output quality directly, track quality metrics over time with statistical process control, and run evaluation pipelines that catch behavior changes before they reach production users.
The four-question postmortem framework — what was the output, what changed, is it systematic, what was the detection gap — forces the analysis to cover the full surface area of an AI incident. Traditional software postmortems focus on code and configuration. AI postmortems must also cover model behavior, retrieval quality, and query distribution — the dimensions that do not appear in traditional incident reports.
Use the runbook template for every incident. Update it after each postmortem. The detection gap you identify in every incident is the most valuable output of the postmortem — it is the work that prevents the next one.
For related reading on monitoring AI quality in production, see our guides on LLM hallucination monitoring, RAG observability, and SRE best practices for AI systems.