Benchmark before and after

You ran an optimization sprint. Now prove it worked.

In L-0576, you learned to dedicate focused time periods to improving a specific agent. Optimization sprints give you the structure and the discipline to make targeted changes. But structure and discipline are not enough. You can sprint with perfect discipline in a direction that makes things worse, and without measurement, you will never know.

The primitive is stark: without a baseline measurement, you cannot know whether your optimization actually improved anything. This is not a suggestion about best practices. It is a logical constraint. Improvement is a comparison between two states — before and after. If you did not record the before state, the comparison cannot exist. You are left with intuition, which is unreliable, and narrative, which is self-serving. The feeling that something got better is not evidence that it did.

This lesson is about establishing the discipline of measurement around every optimization attempt — capturing the before so the after has meaning.

Lord Kelvin's principle and why it persists

In 1883, William Thomson — later Lord Kelvin — delivered a lecture on electrical measurement in which he made a claim that has echoed through every quantitative discipline since: "When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind."

The statement is often shortened to "if you cannot measure it, you cannot improve it." The shortened version is actually more relevant to optimization than Kelvin's original, because it captures the specific relationship between measurement and directional change. You can tinker without measuring. You can modify, adjust, refactor, rewrite. But you cannot know whether those changes constitute improvement without a quantitative reference point.

This is not pedantry. It is the difference between engineering and guessing. Engineering produces predictable, repeatable improvements because it operates on measured quantities. Guessing occasionally produces improvements, but just as often produces regressions disguised as progress. In the context of agent optimization — where systems are complex, outputs are variable, and your intuitions about quality are shaped by recency bias and confirmation bias — the gap between measuring and guessing is enormous.

The before-after design: simplest and most dangerous

The simplest experimental design in research methodology is the one-group pretest-posttest design: measure a system, intervene, measure again, compare. Donald Campbell, in his landmark 1969 paper "Reforms as Experiments," identified this design as the most common and the most vulnerable to false conclusions.

Campbell catalogued the threats. History: between your before measurement and your after measurement, something else changed. You optimized your agent's prompt, but the API provider also updated their model weights that week. Your improvement might be theirs, not yours. Maturation: the system you are measuring evolves on its own over time. User behavior shifts. Input distributions change. The improvement you observe might reflect a natural trend that would have occurred without your intervention. Regression to the mean: if you benchmarked on a day when performance was unusually bad, a subsequent measurement will likely look better regardless of what you did, because extreme values naturally regress toward the average. Instrumentation: if your measurement method changed between the before and the after — different test cases, different evaluation criteria, different scoring rubric — then the comparison is invalid even if the numbers look conclusive.

Campbell's framework was designed for social science, but the threats apply directly to agent optimization. You changed three things simultaneously — the prompt, the model, and the temperature setting — and performance improved. Which change caused the improvement? You do not know, because you did not isolate variables (a principle covered in L-0567). You measured with 50 test cases before and 200 test cases after. The sample sizes are different, the test distributions may be different, and the comparison is compromised. You measured accuracy before but switched to measuring F1 score after because someone told you F1 is better. You are now comparing two different metrics and calling it improvement.

The before-after design is powerful precisely because it is simple. But its simplicity is deceptive. It works only when the measurement protocol is held constant across both conditions. The same test inputs. The same evaluation criteria. The same scoring method. The same environmental conditions, as far as you can control them. Change the system, hold the measurement constant. This is the entire methodology.

What to measure: the benchmark selection problem

Choosing what to benchmark is itself an optimization problem, and getting it wrong undermines everything that follows.

The temptation is to measure what is easy. Latency is trivially measurable — you timestamp the request and the response, subtract, done. Throughput is easy — count the outputs per unit time. Token count is easy. Cost per request is easy. These are all legitimate metrics, but they share a dangerous property: they are all proxies for what you actually care about, which is whether the system produces good results for the people who depend on it.

The metrics that matter most — output quality, correctness, helpfulness, appropriateness of tone, safety of recommendations — are the hardest to measure. They require rubrics, human evaluation, or carefully designed automated scoring. They are expensive. They are slow. And they are the only metrics that tell you whether your optimization made the system genuinely better rather than merely faster or cheaper.

The solution is not to avoid easy metrics. It is to never benchmark only easy metrics. A complete benchmark suite for an agent optimization should include at least three categories:

Efficiency metrics capture how the system uses resources: latency, cost, token usage, API calls per task. These are cheap to measure and important for operational viability.

Accuracy metrics capture whether the system produces correct outputs: routing accuracy, factual correctness, code compilation rate, answer match rate against known-good responses. These require a ground-truth dataset — a set of inputs with known correct outputs — which takes effort to create but can be reused across many optimization cycles.

Quality metrics capture whether correct outputs are also good outputs: helpfulness ratings, coherence scores, completeness assessments, tone appropriateness. These often require human evaluation or carefully calibrated LLM-as-judge pipelines, and they are where most benchmarking efforts fall short.

The discipline is to define all three categories before you start optimizing, measure all three in your baseline, and measure all three after every change. An optimization that improves accuracy while destroying quality is not an improvement. An optimization that reduces latency while degrading accuracy is a tradeoff that should be made consciously, not discovered accidentally three weeks later.

Goodhart's law: when benchmarks become targets

Charles Goodhart, a British economist advising the Bank of England, observed in 1975 that "any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes." The principle was later restated by Marilyn Strathern in its more famous form: "When a measure becomes a target, it ceases to be a good measure."

Goodhart's law is the shadow side of benchmarking. The moment you define a metric and optimize toward it, you create an incentive to game the metric rather than improve the underlying system. This is not a theoretical risk. It is an empirically documented pattern across every domain where metrics drive decisions.

In software engineering, optimizing for code coverage percentage leads teams to write trivial tests that execute code paths without verifying behavior. The metric improves. The actual reliability of the test suite does not. In machine learning, optimizing for benchmark accuracy on datasets like MMLU or HumanEval leads to models that perform well on those specific benchmarks while failing on structurally similar but novel tasks — a phenomenon called benchmark saturation. In healthcare, optimizing for reduced hospital length-of-stay leads to premature discharges and increased emergency readmissions. The metric looks better. The patients do worse.

The defense against Goodhart's law in agent optimization is methodological diversity. Do not optimize toward a single metric. Maintain a suite of metrics that capture different dimensions of performance. Include metrics that are hard to game — qualitative human evaluations, novel test cases that the optimization process has never seen, real-world outcome tracking that is disconnected from the test environment. When your easy metrics improve but your hard metrics stagnate or degrade, Goodhart's law is likely in effect.

Campbell's law — an amplification of Goodhart's principle formulated by Donald Campbell himself — adds that "the more any quantitative indicator is used for decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the processes it is intended to monitor." In the context of agent optimization, this means that the benchmarks you rely on most heavily are the ones most likely to mislead you. Rotate your test cases. Update your evaluation criteria. Introduce novel scenarios. The benchmark must evolve alongside the system it measures, or it will cease to measure anything meaningful.

Continuous benchmarking: from event to infrastructure

The most common benchmarking mistake is treating measurement as a one-time event. You benchmark before an optimization sprint, benchmark after, compare the numbers, and move on. This captures a single before-and-after snapshot, which is better than nothing, but it misses everything that happens afterward.

Systems degrade. Models get updated by their providers. Input distributions shift as user behavior changes. Dependencies break silently. A benchmark result from two weeks ago may no longer reflect the system's current performance, and you will not know unless you measure again.

The modern practice in software engineering is continuous benchmarking — integrating performance measurement into the development pipeline so that every change is automatically evaluated against the baseline. Meta reported catching 92% of performance regressions by moving testing earlier in their CI/CD pipeline. Red Hat has built continuous performance testing frameworks that detect regressions within hours instead of weeks. The principle is the same at every scale: make measurement continuous so that degradation is detected early, when it is cheap to fix, rather than late, when it has already affected users.

For individual agent optimization, continuous benchmarking does not require sophisticated infrastructure. It requires a habit: every time you change the system, run the same test suite and record the results. A spreadsheet with dates, version identifiers, and metric values is a continuous benchmarking system. It is primitive, but it captures the essential information: what changed, when, and what effect it had. This is also the foundation for the optimization logs you will build in L-0578.

The LLM benchmark problem: a cautionary parallel

The current state of large language model evaluation offers a stark illustration of what happens when benchmarking methodology fails to keep pace with the systems it measures.

LLM benchmarks face three compounding problems. First, data contamination: as models are trained on ever-expanding web-scale datasets, the probability that benchmark questions appear in the training data increases. A model that has memorized the answers to MMLU questions is not demonstrating reasoning ability — it is demonstrating recall. The benchmark number goes up. The capability it was supposed to measure has not changed.

Second, benchmark saturation: when the best models approach the upper bounds of a benchmark's scoring range, the benchmark loses its ability to differentiate. If five models all score above 90% on a benchmark designed with a ceiling of 100%, the benchmark cannot tell you which model is better for your use case. It can only tell you that all five are above the threshold where the measurement stops being useful.

Third, format-capability mismatch: the dominance of multiple-choice question formats in LLM benchmarks means that high benchmark scores may reflect skill at pattern-matching answer choices rather than genuine understanding or generation capability. A model that excels at selecting the correct option from four choices may struggle when asked to generate an original answer to the same question.

The lesson for agent optimization is not that benchmarks are useless. It is that benchmarks must be treated as instruments that require calibration, maintenance, and periodic replacement. A benchmark that was valid for your system six months ago may have been rendered meaningless by changes in the model, the input distribution, or the evaluation landscape. The question is never "what does the benchmark say?" The question is always "does this benchmark still measure what I need it to measure?"

The protocol: benchmark before every optimization

Translating this from principle to practice requires a consistent protocol. Here is a minimal viable benchmarking process for any optimization cycle.

Step 1: Define your metrics before you touch the system. Choose at minimum one efficiency metric, one accuracy metric, and one quality metric. Write them down. Specify exactly how each will be calculated. A metric you cannot precisely define is a metric you cannot consistently measure.

Step 2: Build or select your test dataset. This is a fixed set of inputs that you will run through the system both before and after optimization. The dataset should be representative of real usage — not cherry-picked easy cases, not adversarial edge cases, but a sample that reflects the distribution of inputs your system actually encounters. Keep this dataset stable across optimization cycles. If you need to update it, do so deliberately and re-run previous versions for comparability.

Step 3: Run the baseline. Execute the current system against the test dataset and record all metrics. Include the date, the system version or configuration, and any environmental factors that might affect results (model version, API endpoint, hardware). This is your before measurement.

Step 4: Make your change. One change at a time, when possible. If you must make multiple changes simultaneously, document all of them so you can at least correlate the combined change with the combined effect, even if you cannot attribute the effect to individual modifications.

Step 5: Run the benchmark again. Same test dataset. Same metrics. Same evaluation protocol. Same environmental conditions, as far as you can control them. Record everything.

Step 6: Compare and interpret. Did the metrics improve? By how much? Did any metrics degrade? Is the improvement statistically meaningful given the variability in your system, or could it be noise? If you are uncertain, run the benchmark multiple times and look at the variance.

Step 7: Record the results. Date, before metrics, after metrics, what changed, and your interpretation. This record is the raw material for the optimization logs in L-0578.

This protocol adds perhaps thirty minutes to an optimization cycle. That thirty minutes is the difference between knowing you improved the system and hoping you did. Over dozens of optimization cycles, the accumulated records tell you not just whether individual changes worked, but which types of changes tend to work, how large the improvements typically are, and where your optimization efforts have the highest return — intelligence that makes every future optimization sprint from L-0576 more effective.

From measurement to knowledge

Benchmarking before and after is not bureaucratic overhead imposed on creative work. It is the mechanism by which tinkering becomes engineering. Without measurement, every optimization is an anecdote. With measurement, every optimization is a data point. Accumulate enough data points and patterns emerge: which approaches reliably improve performance, which introduce hidden tradeoffs, which dimensions of your system are most responsive to intervention and which are stubbornly resistant.

The optimization sprint gave you dedicated time. The benchmark gives you dedicated truth. What remains is to preserve that truth in a form that compounds — a structured record of what you changed, what you measured, and what happened. That is the work of L-0578.

Sources:

Thomson, W. (Lord Kelvin) (1889). "Electrical Units of Measurement." Popular Lectures and Addresses, Vol. 1. MacMillan and Company. Lecture delivered May 3, 1883.
Campbell, D. T. (1969). "Reforms as Experiments." American Psychologist, 24(4), 409-429.
Goodhart, C. A. E. (1975). "Problems of Monetary Management: The U.K. Experience." Papers in Monetary Economics, Reserve Bank of Australia. Restated by Strathern, M. (1997). "'Improving Ratings': Audit in the British University System." European Review, 5(3), 305-321.
Campbell, D. T. (1979). "Assessing the Impact of Planned Social Change." Evaluation and Program Planning, 2(1), 67-90. Origin of Campbell's Law.
Sedai (2025). "Software Performance Optimization: The Expert Guide for 2025." Comprehensive framework for baseline establishment and continuous benchmarking in software systems.
Evidently AI (2025). "30 LLM Evaluation Benchmarks and How They Work." Survey of benchmark saturation, data contamination, and evaluation methodology limitations in large language model assessment.
Red Hat Developers (2025). "How Red Hat Has Redefined Continuous Performance Testing." Case study on continuous benchmarking integration in CI/CD pipelines for regression detection.