A true root cause, eliminated, makes the error impossible — if it only reduces frequency, keep digging
Test root cause validity by asking whether eliminating the identified cause would make the error impossible rather than merely less frequent—if only frequency decreases, continue deeper analysis.
Why This Is a Rule
Most root cause analyses stop too early because they reach a cause that's plausible and actionable — but isn't actually the root. The test is simple and binary: if you eliminated this cause entirely, would the error become impossible, or merely less frequent? If impossible → you've found the root cause. If less frequent → the cause you've identified is a contributing factor, not the root. The actual root is deeper.
"Why did the deploy fail?" → "Because the test suite didn't catch the regression." If you eliminated test suite gaps entirely (perfect tests), would deploys never fail? No — deploys could still fail from infrastructure issues, configuration drift, or timing problems. The test gap is a contributing factor. Continue digging: why was this specific regression not covered? Why was the coverage gap not detected? Each "why" reaches deeper until you find the cause whose elimination would make this specific error category impossible.
The distinction between "impossible" and "less frequent" is the difference between a root fix and a band-aid. Band-aids reduce error rate; root fixes eliminate the error category. Both have value, but confusing them means you invest root-fix effort in band-aid outcomes.
When This Fires
- During any root cause analysis (Five Whys, fishbone diagram, fault tree)
- When you've identified a candidate root cause and need to verify it's actually root
- When previous "root cause fixes" didn't eliminate the error — they probably weren't actually root
- Complements Stop Five Whys at structural preventability, not at question five — actionability determines the stopping point (Five Whys stopping criterion) with the verification test
Common Failure Mode
Stopping at the first actionable cause: "We missed the deadline because the requirements changed late." If you prevented late requirement changes, would deadlines never be missed? No — delays could still come from estimation errors, resource conflicts, or technical complexity. Late requirements are a contributing factor, not the root. The root might be: "We committed to deadlines before requirements were stable" — and fixing that makes this specific deadline-miss pattern impossible.
The Protocol
(1) After identifying a candidate root cause, apply the elimination test: "If I completely eliminated this cause, would this type of error become impossible?" (2) If yes → this is the root cause. Design a structural fix that eliminates it (Recurring errors with the same root cause need structural fixes, not more effort — process changes beat discipline every time). (3) If only frequency decreases → this is a contributing factor. Useful to address, but continue analysis to find the deeper cause. Ask "why" again about the contributing factor. (4) You may find multiple root causes for a single error (When 'why' has multiple independent answers, branch the analysis into a tree — multi-causal problems aren't linear chains). Each should pass the elimination test independently for its causal branch. (5) Don't be discouraged if the root cause is systemic and hard to fix. The point of root cause analysis is to find the true cause, not the convenient one.