PromptZone - Leading AI Community for Prompt Engineering and AI Enthusiasts

Cover image for Why AI Teams Get No Credit for Prevented Failures
Sofia Fischer
Sofia Fischer

Posted on

Why AI Teams Get No Credit for Prevented Failures

The 2001 paper "Nobody Ever Gets Credit for Fixing Problems That Never Happened" by Nelson Repenning and John Sterman surfaced again on Hacker News with 525 points and 170 comments. It models how organizations create self-reinforcing cycles that starve preventive work.

The core mechanism is a capability trap. Pressure to ship new features reduces time spent on maintenance and testing. Quality drops, firefighting increases, and even less time remains for prevention. The loop locks in.

How the Dynamics Model Works

The authors built a system dynamics simulation with two stocks: Work-in-Process and Organizational Capability. Capability grows only through deliberate investment in process improvement. When managers allocate all resources to throughput, capability erodes.

Simulations showed that a 10% cut in preventive effort produced a 25-40% rise in rework within six months. Recovery required sustained investment above the original baseline for 12-18 months.

Why AI Teams Get No Credit for Prevented Failures

Evidence from the Paper

The model was calibrated on data from a semiconductor equipment manufacturer and a chemical plant. Both sites showed identical patterns: measured productivity rose while actual throughput fell once hidden defects accumulated.

HN commenters noted the same pattern in modern AI deployments where teams deprioritize evaluation harnesses and red-teaming to meet release dates.

How to Apply It in AI Workflows

Run a simple stock-and-flow audit on your current sprint allocation. Track hours spent on:

  • New feature development
  • Post-incident fixes
  • Proactive testing and monitoring

If proactive hours fall below 20% of total engineering time for more than two quarters, the trap is active.

Tradeoffs of Prioritizing Prevention

  • Gains: fewer production incidents and lower long-term cost
  • Costs: slower visible feature velocity for 3-6 months
  • Risk: managers measured on short-term output may still cut the budget

Comparison with Common AI Practices

Approach Prevention Hours Incident Rate Recovery Time
Firefighting culture <10% High 4-8 weeks
Repenning-Sterman allocation 25-30% Low 1-2 weeks
Typical MLOps team 15% Medium 3 weeks

Who Benefits Most

Teams shipping production LLMs or autonomous agents should adopt the framework. Research groups publishing only novel architectures can ignore it. Organizations with quarterly OKRs tied solely to new capabilities will fight the required investment.

Bottom line: The paper supplies a quantitative model showing why AI reliability work remains chronically underfunded unless measurement systems change.

The same dynamics that slowed factory improvement in 2001 now limit safe deployment of increasingly capable models.

Top comments (0)