Feedback loop hygiene

Every feedback loop has a shelf life

You built a feedback loop. Maybe it was deliberate — a weekly review, a dashboard metric, a retrospective cadence. Maybe it emerged organically — you started checking a number, and now you steer by it. Either way, it worked. You could see the signal, adjust your behavior, and watch the outcome improve.

Then, gradually, it stopped working. Not with a dramatic failure. Not with an error message. The metric kept moving in the right direction while the thing you actually cared about moved in the wrong one. The dashboard stayed green. The reality turned red. And you didn't notice, because you were watching the dashboard.

This is what feedback loop decay looks like. It is not a theoretical risk. It is the default trajectory of every measurement system that is not actively maintained. Your feedback loops are degrading right now. The question is whether you have a practice for catching the decay before it costs you.

Goodhart's law: the measure that eats itself

In 1975, British economist Charles Goodhart observed a pattern in monetary policy: statistical regularities collapse once you use them for control. His original formulation was dense and academic. In 1997, anthropologist Marilyn Strathern distilled it into the version that stuck: "When a measure becomes a target, it ceases to be a good measure."

This is not a suggestion that metrics are useless. It is a structural claim about what happens to any measurement the moment you attach consequences to it. The act of optimizing for a metric changes the relationship between the metric and the thing it was supposed to represent.

Consider the mechanism. A metric starts as a proxy — an observable signal that correlates with an outcome you care about. Lines of code correlate with engineering output. Test scores correlate with student learning. Body count correlates with military progress. The proxy works precisely because nobody is optimizing for it directly. It is a side effect of the real activity.

The moment you make the proxy a target — tie bonuses to it, report it to stakeholders, use it to rank people — rational actors begin optimizing for the proxy instead of the outcome. Engineers write verbose code. Teachers teach to the test. Soldiers inflate kill reports. The correlation between proxy and outcome breaks, but the proxy keeps going up, so everyone thinks things are working.

Goodhart's law is not about bad people gaming metrics. It is about the structural impossibility of using a proxy as both a measurement and a target simultaneously. The act of targeting destroys the measurement.

Campbell's law: corruption at scale

Social psychologist Donald T. Campbell formalized the institutional version of this pattern in 1979: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor."

Campbell was writing about social programs, but his insight applies everywhere consequences attach to numbers. The key addition to Goodhart is the word corrupt. Goodhart describes a measurement going stale. Campbell describes a measurement actively poisoning the system it monitors.

The distinction matters. A stale metric gives you no information. A corrupt metric gives you wrong information that you trust. The second failure mode is categorically worse, because you continue steering by a signal that is now pointing you toward damage.

Campbell himself used education as his case study, and his prediction proved precise. When U.S. schools began tying funding and teacher evaluations to standardized test scores, the scores went up. Learning did not. A 2011 investigation in Atlanta revealed that 178 teachers and principals across 44 schools had systematically altered student answer sheets. The metric improved. The children couldn't read.

The body count and the cross-sell: two case studies in metric corruption

Robert McNamara, U.S. Secretary of Defense from 1961 to 1968, reduced the Vietnam War to a quantitative optimization problem. Unable to measure guerrilla morale, political sentiment, or territorial control in any clean way, he defaulted to the metric he could count: enemy dead. The "body count" became the feedback loop for an entire war.

The result was textbook Campbell's law. Field officers, under pressure to show progress, inflated reports. A post-war survey found that sixty-one percent of officers believed body counts were "often inflated." Worse, the metric incentivized indiscriminate killing — any dead body could be counted as an enemy combatant. The United States won every statistical engagement and lost the war, because the feedback loop had decoupled from the strategic reality it was supposed to track. Daniel Yankelovich later described the pattern as a four-step fallacy: measure what you can, disregard what you can't, presume the unmeasurable is unimportant, then declare it nonexistent.

Fifty years later, Wells Fargo ran the same experiment in retail banking. Leadership tracked "cross-sell ratio" — the number of financial products per customer — as the north-star metric for its Community Bank division. Employees were ranked, compensated, and fired based on this number. Between 2002 and 2016, employees opened approximately 3.5 million unauthorized accounts in customers' names. The cross-sell metric looked exceptional. Customer trust was being systematically destroyed. Wells Fargo paid over $3 billion in settlements and regulatory fines — not because the metric broke, but because it worked exactly as incentivized.

In both cases, the feedback loop was functioning perfectly at the mechanical level. Signals went in, behavior adjusted, numbers moved. The problem was that the signals had detached from the outcomes that mattered. And nobody scheduled a hygiene check.

The AI parallel: benchmark overfitting and eval contamination

If you work with AI systems, you are watching Goodhart's law operate at industrial scale in real time.

Large language models are evaluated against public benchmarks — standardized test suites that measure reasoning, coding ability, factual knowledge, and language understanding. These benchmarks started as useful proxies for model capability. Then labs began optimizing directly for benchmark performance.

The result is predictable. Models that dominate leaderboards routinely underperform in production on tasks the benchmarks were supposed to represent. A model scoring 99% on a contaminated benchmark may struggle with the actual workflow it was deployed for. The benchmarks, designed to measure capability, now measure benchmark performance — a different thing entirely.

The contamination problem is structural, not incidental. Benchmark questions leak into training data. Models memorize answers rather than learning the underlying capability. A 2024 survey on benchmark data contamination documented the scope: test sets from widely-used benchmarks appearing verbatim in training corpora, inflating scores without improving the capabilities those scores claim to represent.

The AI community's response illustrates what feedback loop hygiene looks like at the field level. LiveBench releases new questions monthly, using recently-published sources to limit memorization. NPHardEval refreshes its datapoints on a monthly cycle. These are not just better benchmarks — they are benchmarks with built-in hygiene practices, designed to resist the decay that Goodhart's law predicts.

In production ML systems, the same pattern plays out as model drift. A model trained on historical data performs well at deployment. Over months, the real-world data distribution shifts — user behavior changes, market conditions evolve, new patterns emerge that the training data never contained. Without monitoring, models left unchanged for six months or more see error rates climb dramatically. The feedback loop between model output and real-world accuracy degrades silently until something breaks visibly enough to trigger an investigation.

The lesson from AI is the same lesson from Vietnam and Wells Fargo: the decay is silent, the consequences are not.

Why feedback loops degrade: four mechanisms

Understanding why loops go stale helps you build maintenance into them from the start.

1. Proxy drift. The relationship between your metric and your outcome was never a law of physics — it was a correlation that held under specific conditions. When conditions change, the correlation weakens. Your "customer satisfaction score" correlated with retention when your product was simple. After you added enterprise features, the users answering the survey are no longer representative of the users who churn. The proxy drifted. The number didn't tell you.

2. Optimization pressure. This is Goodhart's mechanism. The harder you push on a metric, the more the system reorganizes to produce that metric through paths that bypass the outcome. Every shortcut, every workaround, every dark pattern is a rational response to optimization pressure applied to a proxy.

3. Environmental shift. Your feedback loop was calibrated for an environment that no longer exists. The leading indicator that predicted sales in 2023 may be noise in 2026 because the market structure changed. The habit tracker that motivated you when you were building a new routine becomes an empty ritual once the habit is automatic. The loop is still running. The environment moved on.

4. Attention decay. You stop looking at the metric with fresh eyes. It becomes wallpaper. You glance at it, confirm it's in the expected range, and move on. You no longer ask "what is this actually telling me?" You just check whether the number is green. This is the most common failure mode and the hardest to detect, because it feels like everything is working.

The hygiene practice: scheduled audits

Feedback loop hygiene is not a philosophy. It is a recurring practice with concrete steps. Here is the protocol.

Monthly: the loop audit. Set a recurring fifteen-minute block. List every metric, signal, or indicator you currently act on — professionally and personally. For each one, answer three questions:

What behavior does this metric actually incentivize? Not what you want it to incentivize. What does it actually reward? If you're measuring "tickets closed per sprint," it rewards closing easy tickets and splitting work into smaller units. That may or may not align with shipping quality software.
Is the proxy still correlated with the outcome? When you first chose this metric, it tracked something real. Does it still? What changed in your environment, your goals, or your system since you last verified this?
If someone were gaming this metric, what would they do? If your answer looks uncomfortably similar to what you're currently doing, the loop has been captured. You are optimizing for the proxy, not the outcome.

Quarterly: the replacement review. Some metrics need to be retired, not repaired. Every quarter, ask: which of my current feedback loops should be replaced entirely? A metric that served you in the first six months of a project may be actively misleading in month eighteen. Retiring a metric is not failure. It is maintenance.

On every major change: the recalibration trigger. When your goals shift, your team changes, your market evolves, or your system architecture is restructured, every existing feedback loop is suspect. Treat major changes as mandatory recalibration events. Do not assume your old metrics still apply in the new context.

The meta-loop: feedback on your feedback

The deepest form of feedback loop hygiene is recursive. You need a feedback loop that monitors the health of your other feedback loops. This sounds abstract until you build it.

In practice, it means tracking decisions you made based on metrics and checking whether those decisions produced the outcomes you expected. If you optimized for a metric and the underlying reality improved, the loop is healthy. If you optimized for a metric and nothing changed — or things got worse — the loop is decoupled.

This is the difference between a team that uses metrics and a team that is used by metrics. The first group treats measurement as a tool subject to calibration. The second group treats measurement as truth, and truth doesn't need maintenance.

Your feedback loops are not infrastructure you build once. They are living systems that require the same ongoing attention you give to the processes they monitor. The loops that feel most reliable — the ones you've stopped questioning — are the ones most likely to have drifted.

Check them. On a schedule. Before the dashboard goes green while reality goes red.