When an LLM-based system fails, the failure is almost never a classic exception or timeout. It is a wrong answer that looks right. A refusal that should have been a pass. A confident hallucination that a user acted on. These incidents are harder to investigate, harder to resolve, and harder to prevent — because the system did not break. It worked exactly as designed, producing output that was subtly, dangerously wrong.

Traditional incident postmortems ask two questions: what broke? and how do we fix it? AI incidents require four. This template gives you the structure to answer all four — and the actionable runbook to prevent recurrence.

Why Standard Postmortems Break for AI Systems

Traditional postmortems assume deterministic failure: the same input always produces the same error. Your监控系统 sees a spike in error rate, you find the bad deploy, you roll back. The causal chain is recoverable from logs.

LLM failures are often non-deterministic. A prompt that produces a harmful output at 2 PM may produce a correct answer at 2:05. The same input — same user question, same retrieved context, same temperature setting — produces different responses because of the probabilistic nature of autoregressive generation. A model that passed all pre-deployment tests begins degrading in production because the distribution of real-world inputs differs from your test set. Or a model update changes the refusal behavior for seemingly unrelated query categories.

These failure modes cannot be diagnosed with log analysis alone. You need the four-question AI incident framework.

The Four-Question AI Incident Framework

Before filling out the template below, answer these four questions in order. They form the diagnostic backbone of every AI incident investigation.

Question 1: What was the output, and what should it have been?

Establish the exact gap between observed and expected behavior. Be specific. "The model hallucinated" is not a precise description. "The model returned a case citation for Smith v. Jones (1999) which does not exist in our legal corpus, and three attorneys used this citation in court filings" is specific.

For classification systems: what was the false positive rate on high-risk categories during the incident window? For generative systems: what was the nature of the harmful content — factual hallucination, safety refusal failure, prompt injection, or output format corruption? Separate the symptom from the cause.

Question 2: What changed — in the model, the data, the pipeline, or the traffic?

Identify the proximate cause. For AI systems, the possible change vectors are wider than traditional infrastructure:

  • Model change: new model version, fine-tune deployed, system prompt modified, temperature/r sampling parameters changed, base model weights updated
  • Data/pipeline change: retrieval index updated, embedding model changed, document ingestion pipeline modified, feature store schema altered, training data distribution shift
  • Infrastructure change: inference engine version updated, GPU/CPU hardware change, network routing change, latency increase in upstream service
  • Traffic change: new user cohort with different query patterns, adversarial input pattern, sudden traffic spike exposing batch size limits
  • Environment change: external API version change, third-party API response format change, time-based context shift (seasonality, news cycle)

Question 3: Was the model working correctly given its inputs?

This is the question that trips up engineers trained on traditional systems. The model may have produced a wrong answer even though it was functioning correctly given the information it had. The failure may be in the retrieval pipeline — the model received corrupted or irrelevant context — not in the model itself.

To answer this question, you need to reconstruct the exact context the model received at the time of the incident: the retrieved documents, the system prompt state, the conversation history, the temperature setting. Then determine whether the model's output was a reasonable response to that input, even if the input itself was wrong.

Question 4: Is this a one-off or a systematic regression?

Determine whether this incident represents an isolated event or an emerging pattern. Check: has this specific input category (or semantically similar inputs) produced the same failure before? Has the model's behavior on this output category changed over the past 7 days, 30 days? Are other users reporting the same issue?

A single hallucination on a novel query may be acceptable noise. A systematic degradation in refusal behavior for a specific risk category is an emergency.

The AI Incident Postmortem Template

Copy the section below into your incident management tool (Linear, Jira, Notion) and fill in each field.


INCIDENT TITLE: [Brief description of what failed, who was affected, and when]

Example: Medical AI triage assistant returned incorrect high-risk classification for 12 patients over 3-hour window on 2026-03-15

Date of Incident: [YYYY-MM-DD HH:MM to HH:MM UTC]

Date Postmortem Written: [YYYY-MM-DD]

Severity: [SEV-1 / SEV-2 / SEV-3]

  • SEV-1: Direct user harm, regulatory exposure, or data breach
  • SEV-2: Significant user impact, work stoppage, or service degradation
  • SEV-3: Minor user impact, cosmetic issue, or near-miss

Status: [Open / Mitigated / Resolved]


Section 1: Incident Summary (3-5 sentences)

[Who was affected, what happened, what was the immediate impact, and what was done to stop the bleeding. Focus on facts, not speculation.]

Section 2: Timeline

All times in UTC. Include timezone if different.

  • [HH:MM] — [Event description]
  • [HH:MM] — [Alert fired / user report received]
  • [HH:MM] — [Engineer notified]
  • [HH:MM] — [Investigation started]
  • [HH:MM] — [Root cause identified]
  • [HH:MM] — [Mitigation applied]
  • [HH:MM] — [Incident resolved]

Section 3: Root Cause Analysis — Four Questions

Q1: What was the output vs. what should it have been?

[Describe the specific output produced and the expected correct output. Include exact examples where available.]

Q2: What changed (model, data, pipeline, traffic)?

[List all changes in the 14 days prior to the incident across: model version, system prompt, retrieval pipeline, infrastructure, and traffic patterns. Note which change is the likely cause.]

Q3: Was the model working correctly given its inputs?

[Reconstruct the exact model input context. Evaluate whether the model's output was a reasonable response to that input. If the input was wrong, trace the upstream data pipeline failure.]

Q4: Is this a one-off or a systematic regression?

[Check prior incidents, monitoring dashboards, and user reports. State whether this is an isolated event or part of a pattern. Include specific metrics.]

Section 4: Detection and Alerting

How was the incident detected? [User report / automated alert / internal testing]

Time from failure to detection: [X hours Y minutes]

Was the detection time acceptable? [Yes/No — if no, explain why the monitoring gap existed]

Alert thresholds that should have caught this: [List metrics and their thresholds]

Were those alerts firing? [Yes/No — if no, why not]

Section 5: Mitigation

Immediate mitigation (what stopped the bleeding):

  • [Step taken to prevent ongoing harm]
  • [Step taken to restore correct service]

Duration of service degradation: [X hours Y minutes]

Section 6: Long-Term Remediation

Action items:

Action Owner Due Date Priority
[Describe action] [Assignee] [YYYY-MM-DD] [P1/P2/P3]

Section 7: Monitoring and Prevention — Production Runbook

Use this checklist to prevent recurrence. Implement the relevant checks based on the incident type:

For Hallucination/Factual Error Incidents

  • [ ] Ground-truth validation on high-stakes outputs (set up automated factual consistency checks against trusted knowledge base)
  • [ ] Confidence score thresholds — route outputs below threshold to human review
  • [ ] Output sampling in production — periodically sample and manually review outputs for factual accuracy
  • [ ] Structured output enforcement — use JSON mode or constrained decoding to reduce free-form hallucination surface
  • [ ] RAG citation verification — validate that every retrieved citation exists in the source corpus before generation

For Retrieval/Context Failure Incidents

  • [ ] Retrieval quality monitoring — track retrieval precision/recall on a synthetic evaluation set
  • [ ] Document freshness alerts — notify when knowledge base hasn't been updated in X days
  • [ ] Chunking boundary checks — ensure retrieved chunks don't span irrelevant section breaks
  • [ ] Embedding drift detection — compare embedding space for known queries before/after index updates
  • [ ] Context completeness checks — verify that retrieved context actually answers the user's query before passing to model

For Refusal/Output Safety Incidents

  • [ ] Safety output monitoring — track refusal rates by risk category; alert on sudden drops
  • [ ] Adversarial input probing — regularly test known jailbreak patterns against deployed model
  • [ ] Red team schedule — run structured red team exercises quarterly and after any model change
  • [ ] System prompt integrity checks — validate system prompt hasn't drifted from the known-good baseline
  • [ ] Output filtering regression tests — run regression suite against known-harmful outputs after any model or pipeline change

For Model Degradation/Drift Incidents

  • [ ] Production evaluation pipeline — continuously evaluate model on a golden dataset in production traffic
  • [ ] Canary analysis — split traffic and compare model versions; alert on statistically significant quality differences
  • [ ] Behavioral drift metrics — track token distribution, response length, and refusal rate distributions over time
  • [ ] Data distribution monitoring — track query type distribution and alert on shifts from baseline
  • [ ] Shadow mode evaluation — run new model in shadow mode before full traffic switch

For Infrastructure/Performance Incidents

  • [ ] Throughput SLOs — alert when tokens/second drops below baseline by more than 20%
  • [ ] GPU memory monitoring — alert on KV cache hit rate degradation (cache thrashing)
  • [ ] TTFT/TPOT tracking — alert when time-to-first-token or time-per-output-token exceeds p99 baseline
  • [ ] Inference engine health checks — restart or failover when error rate exceeds 1%
  • [ ] Batch size monitoring — alert on unexpected batch size changes that indicate scheduler issues
Advertisement
Advertisement

Real Example: Medical AI Triage Assistant Incident

Here is a filled-out postmortem for a real-category incident: a medical AI triage assistant that stopped refusing high-risk queries.


INCIDENT TITLE: Medical AI triage assistant returned "low-risk" classification for 12 patients with high-risk cardiac symptoms over 3-hour window

Date of Incident: 2026-03-15 14:23 to 17:41 UTC

Severity: SEV-1

Status: Resolved


Section 1: Incident Summary

Between 14:23 and 17:41 UTC on March 15, the AI triage assistant misclassified 12 patients as low-risk when they presented with symptoms consistent with acute cardiac events (chest pain with radiating symptoms, shortness of breath, diaphoresis). The error was caused by a model update deployed at 14:00 UTC that changed the refusal threshold for the high-risk category from 0.7 to 0.3 without corresponding documentation in the deployment record. The incorrect classifications led to delayed nurse assessments for affected patients. No adverse patient outcomes occurred — a nurse caught the first error 23 minutes after the system deployed.

Section 2: Timeline

  • 13:55 — Model update v2.4.1 deployed to production (change: "adjusted safety thresholds")
  • 14:00 — Deployment complete, traffic switched to new model
  • 14:23 — First misclassified patient encounter logged
  • 14:46 — Charge nurse flags abnormal number of "low-risk" classifications for chest pain presentations
  • 14:52 — On-call engineer paged
  • 15:10 — Root cause identified: safety threshold parameter changed from 0.7 to 0.3
  • 15:18 — Rollback to v2.4.0 initiated
  • 15:31 — v2.4.0 confirmed live, traffic restored
  • 17:41 — All 12 affected patient records reviewed, none required escalation beyond nursing assessment

Section 3: Root Cause Analysis — Four Questions

Q1: Output vs. Expected: The model returned a risk classification of "low-risk, routine follow-up" for 12 patients whose symptoms (chest pain radiating to left arm, diaphoresis, reported shortness of breath) should have been classified as "high-risk, immediate nurse assessment required" per the triage protocol. The expected output is a specific structured JSON with risk_level: "high" and recommended_action: "immediate_nurse_assessment".

Q2: What changed: The model deployment v2.4.1 included a configuration change to the safety scoring threshold, lowering it from 0.7 to 0.3. This change was intended to reduce false refusals (the model was over-refusing on ambiguous cases) but was not captured in the deployment changelog, not reviewed by the medical safety team, and not validated against the triage evaluation dataset before deployment. The change was introduced in a config file that bypassed the standard ML deployment review gate.

Q3: Was the model working correctly given its inputs: Yes. The model received correct symptom descriptions in the structured input. Its output was a correct low-risk classification given its internal threshold of 0.3 — the failure was in the threshold configuration, not in the model inference itself.

Q4: One-off or systematic: Systematic for the 3-hour deployment window. All 12 high-risk misclassifications occurred after the threshold change was deployed. Pre-deployment false refusal rate was 12%. During the incident window, false refusal rate dropped to 0% — which should have been an immediate signal that the safety behavior had changed, not just improved.

Section 4: Detection and Alerting

Detection: Charge nurse manual review (user report)

Time to detection: 23 minutes

Acceptable? No — a 23-minute delay with medical triage is not acceptable. Automated detection should have caught this within 2 minutes.

Alerts that should have caught this: Safety refusal rate monitoring (expected: 10-15%, caught: 0%), high-risk classification rate (expected: 8-12%, caught: 0%)

Were alerts firing? No — refusal rate monitoring was not implemented in production.

Section 5: Mitigation

Immediate: Rolled back model to v2.4.0. Disabled automated triage classifications pending safety review. All 12 affected patients were reviewed by a charge nurse; no escalations were required beyond standard nursing assessment.

Degradation duration: 3 hours 18 minutes

Section 6: Long-Term Remediation

Action Owner Due Date Priority
Implement safety refusal rate monitoring in production with 5-minute alert ML Ops 2026-03-22 P1
Require medical safety team sign-off on all model config changes ML Lead 2026-03-18 P1
Add model config parameters to the ML deployment review gate Engineering 2026-03-25 P2
Implement shadow mode for medical triage model updates (run new version in parallel before full switch) ML Platform 2026-04-01 P2
Advertisement
Advertisement

Tools for Automated AI Incident Detection

The fastest way to close the detection gap is automated monitoring. The following tools are purpose-built for AI pipeline observability and can catch the failure patterns described in this template:

For structured output validation: LLM Observability Guide covers how to implement output schema validation at scale — enforce that critical fields like risk classification match expected values.

For retrieval quality monitoring: RAG Observability covers retrieval precision/recall tracking, context completeness scoring, and citation verification pipelines.

For production model evaluation: LLM Evaluation Frameworks covers continuous production evaluation with golden datasets, behavioral drift detection, and regression testing against known failure patterns.

Partner Offer Monitor Your LLM Pipeline with Production Evaluation

Weights & Biases Weave provides automated evaluation tracking, drift detection, and production inference monitoring for LLM systems. Integrated with LangChain, LlamaIndex, and vLLM.

Conclusion

AI incidents are different from traditional software incidents in ways that matter for investigation and prevention. The four-question framework — output vs. expected, what changed, was the model working given its inputs, one-off or systematic — gives you a consistent diagnostic structure that handles the non-deterministic nature of LLM failures.

The template in this article is copy-paste ready. Fill it out after every AI incident, even the minor ones. The discipline of structured postmortems compounds — the more incidents you document, the more patterns you'll see across incidents, and the better your prevention runbooks will become.

The medical triage incident above was caught by a human after 23 minutes. With automated safety refusal rate monitoring, the same incident would have been detected in under 5 minutes. Invest in production evaluation infrastructure before you need it — not after your first SEV-1.