Agent effectiveness metrics

Your agent runs perfectly. It just doesn't work.

You built a weekly review habit. Every Sunday at 4 PM, you sit down with your notebook, review the past seven days, and plan the next seven. You have not missed a single week in three months. By any reliability metric, this agent is performing flawlessly.

But here is the question you have not asked: does the weekly review actually change what happens the following week? When you plan to prioritize deep work on Monday, do you actually do deep work on Monday? When you identify a relationship you've been neglecting, do you reach out? When you notice a recurring time sink, do you eliminate it?

If your answer is "sometimes" or "I'm not sure," then you have a reliable agent with an unknown effectiveness rating. You are monitoring whether the machine turns on, not whether it produces the thing it was built to produce. And that distinction -- between running and working, between output and outcome, between efficiency and effectiveness -- is one of the most consequential measurement errors in both organizational science and personal cognitive systems.

The Drucker distinction: doing things right versus doing the right things

Peter Drucker, in The Effective Executive (1967), drew a line that management science has been refining ever since: "Efficiency is doing things right. Effectiveness is doing the right things." He was not being cute. He was identifying a category error that destroys organizations and individuals alike.

Efficiency asks: given this task, are we executing it with minimal waste? Effectiveness asks: should we be executing this task at all, and is it producing the result we need? Drucker went further in Management: Tasks, Responsibilities, Practices (1974): "Effectiveness is the foundation of success -- efficiency is a minimum condition for survival after success has been achieved. Efficiency is concerned with doing things right. Effectiveness is doing the right things."

This maps directly onto agent monitoring. L-0545 taught you to track reliability -- does your agent fire when it should and stay silent when it shouldn't? That is an efficiency-class metric. It tells you the machine is doing things right. But effectiveness asks the prior question: when the machine fires, does the intended outcome occur in the world?

A morning journaling agent that fires every day (reliable) but never surfaces an insight you act on (ineffective) is an efficient failure. A conflict-resolution protocol that activates every time you sense interpersonal tension (reliable) but leaves the actual conflict unresolved 70% of the time (ineffective) is a well-oiled machine pointed at the wrong target. Reliability is necessary but not sufficient. Effectiveness is what you actually care about.

Output metrics versus outcome metrics

The organizational measurement literature draws a sharp distinction between outputs and outcomes. An output is what a process produces -- the deliverable, the artifact, the thing that rolls off the assembly line. An outcome is the change in the world that the output was supposed to create.

A hospital's output is surgeries performed. Its outcome is patients who recover. A school's output is classes taught. Its outcome is students who learn. A nonprofit's output is meals served. Its outcome is food insecurity reduced. In every case, the output can be high while the outcome is low. You can perform more surgeries and have worse patient outcomes. You can teach more classes and have declining test scores. You can serve more meals while the underlying problem grows.

The same structure applies to your cognitive agents. Consider a decision-making heuristic -- say, the pre-mortem technique, where before committing to a plan, you imagine it has already failed and work backward to identify causes. The output of this agent is a pre-mortem analysis. The outcome is better decisions -- specifically, decisions where the risks you identified in the pre-mortem were either mitigated or didn't materialize. If you run pre-mortems religiously (high output) but your decision quality doesn't improve (low outcome), the agent is producing artifacts without producing results.

Jeff Gothelf, who writes extensively on outcome-driven product development, captures the trap concisely: teams celebrate shipping features (outputs) when they should be measuring whether user behavior changed (outcomes). The feature is not the point. The behavior change is the point. Your agent firing is not the point. The intended state change in your thinking, behavior, or environment is the point.

How to define an effectiveness metric for any agent

An effectiveness metric answers one question: when this agent fires, does the intended outcome actually occur?

Defining this metric requires three things:

1. A specific intended outcome. Not "think more clearly" but "identify the highest-leverage task for the day." Not "handle conflict better" but "reach explicit agreement on next steps within the conversation." The outcome must be concrete enough that you can observe whether it happened. If your intended outcome is vague, your effectiveness metric will be meaningless -- you will always be able to claim partial success.

2. An observable indicator. You need evidence that the outcome occurred or didn't. For a planning agent, the indicator might be: did I complete the task I identified as highest-priority? For a conflict-resolution agent: did both parties confirm agreement on next steps? For an information-processing agent: did I correctly identify the key claim in the source material, as verified by a later check? The indicator must exist outside your subjective feeling about whether the agent "worked." Feelings are unreliable here because of the same metacognitive blind spots L-0006 documented.

3. A time horizon. Effectiveness often has a delay. Your morning planning agent's effectiveness cannot be measured at 7:15 AM -- it can only be measured at end of day, when you check whether the plan survived contact with reality. A weekly review's effectiveness might only be visible after the following week. Define when you'll check.

With these three elements, the metric is simple arithmetic: how many times did the intended outcome occur, divided by how many times the agent fired? That ratio is your effectiveness rate.

The precision and recall lens

Machine learning offers a useful framework for thinking about effectiveness metrics with more granularity. In classification systems, two metrics capture different failure modes:

Precision measures: of all the times the system said "positive," how many were actually positive? Applied to agents: of all the times your agent fires and you act on its output, how often does that action produce the intended result? Low precision means your agent generates a lot of false signals -- it triggers action, but the action doesn't work.

Recall measures: of all the actual positive cases, how many did the system catch? Applied to agents: of all the situations where the agent's intended outcome was needed, how many did the agent actually deliver? Low recall means your agent misses opportunities -- situations arise where its outcome would have been valuable, but it either didn't fire or fired ineffectively.

A conflict-resolution agent with high precision but low recall successfully resolves conflicts when it fires, but fires in too few of the situations that need it. An agent with high recall but low precision fires in every relevant situation but resolves the conflict only 30% of the time.

The balance between precision and recall depends on context. For high-stakes agents (financial decisions, career choices), you want high precision -- when the agent fires and you act, it needs to work. For broad-coverage agents (daily prioritization, information filtering), you might tolerate lower precision in exchange for high recall -- you'd rather catch every relevant situation and accept some misfires than miss critical ones.

Leading indicators predict; lagging indicators confirm

Effectiveness metrics themselves come in two temporal flavors, and confusing them creates measurement problems.

Lagging indicators tell you what already happened. Your weekly review's effectiveness rate over the past month is a lagging indicator. It confirms whether the agent has been working. It is accurate but not actionable in real time -- by the time you have the data, the outcomes have already occurred.

Leading indicators predict what will happen. The quality of your morning plan (specificity, feasibility, alignment with weekly goals) might be a leading indicator of whether the plan will survive the day. The depth of your pre-mortem analysis (number of risks identified, specificity of mitigations) might predict decision quality. Leading indicators are less accurate but more actionable -- they give you a signal before the outcome is locked in.

You need both. Lagging indicators tell you whether to keep, modify, or retire an agent. Leading indicators tell you, in the moment, whether this particular firing is likely to produce the intended outcome -- giving you a chance to intervene if the signal looks weak.

Goal attainment scaling: measuring degrees of effectiveness

Not all outcomes are binary. Your planning agent doesn't simply "work" or "not work." Some days the plan holds perfectly. Some days you complete two of three priorities. Some days the plan collapses entirely but you salvage one important action. Binary effectiveness metrics lose this nuance.

Kiresuk and Sherman (1968) developed Goal Attainment Scaling (GAS) to solve exactly this problem in clinical settings. Their method defines five levels for each goal, ranging from "much less than expected" (-2) to "much more than expected" (+2), with "expected outcome" at zero. Each agent firing gets scored on this scale, and the scores can be aggregated into a standardized metric that captures both the direction and degree of effectiveness.

Applied to your agents, the scale might look like this for a morning planning agent:

-2: Plan abandoned entirely, no priority tasks attempted
-1: One of three priorities completed, significant drift to reactive work
0: Two of three priorities completed (expected outcome)
+1: All three priorities completed
+2: All three completed plus unexpected high-value opportunity captured

This graded approach gives you a richer signal than a binary "worked / didn't work." Over time, you can track whether your agent's GAS scores are trending upward (the agent is improving), flat (stable effectiveness), or downward (degrading -- intervention needed). It also prevents the discouragement that comes from binary scoring, where a day that was mostly successful gets the same "failure" mark as a day that completely collapsed.

Goodhart's law: when effectiveness metrics become targets

There is a trap embedded in all measurement, and it applies to self-measurement with the same force it applies to organizational measurement.

Charles Goodhart, a British economist, articulated in 1975 what is now known as Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Donald Campbell stated the same insight more forcefully: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor."

In organizations, this manifests as hospitals reducing wait times by redefining when "waiting" starts, schools improving test scores by teaching to the test rather than teaching the subject, and developers closing more tickets by splitting one task into five.

In personal cognitive systems, the same dynamic operates. If your effectiveness metric for a weekly review is "percentage of planned tasks completed," you will unconsciously start planning easier tasks. If your metric for a conflict-resolution agent is "percentage of conflicts where agreement is reached," you will start avoiding the hardest conflicts or accepting shallow agreements that don't actually resolve anything. The metric gets better. The actual effectiveness gets worse.

The defense against Goodhart's Law is to monitor multiple dimensions of effectiveness simultaneously, making it difficult to game one without revealing the distortion in another. Track task completion rate alongside task difficulty. Track conflict resolution rate alongside relationship quality over time. Track decision accuracy alongside the stakes of decisions you're willing to make. No single metric is safe. A constellation of metrics is harder to corrupt.

The effectiveness audit

Here is a concrete protocol for auditing the effectiveness of an agent you're already monitoring for reliability.

Step 1: State the intended outcome. Write one sentence: "When this agent fires, the intended result is ___." Be specific enough that an outside observer could verify whether it happened.

Step 2: Define the observable indicator. What evidence, external to your subjective sense, would confirm the outcome? A completed task, a documented decision, a behavioral change, a measured improvement?

Step 3: Set the time horizon. When, after firing, should you check for the outcome? Same day? End of week? After the next relevant situation arises?

Step 4: Score the last five firings. Go back through your recent monitoring data and score each instance. Use the five-point GAS scale if binary feels too coarse.

Step 5: Calculate and record. Effectiveness rate = instances where outcome occurred / total firings. GAS average = sum of scores / number of firings.

Step 6: Compare to reliability. Your reliability score from L-0545 tells you how often the agent fires when it should. Your effectiveness score tells you how often that firing produces the intended outcome. The gap between these two numbers is the precise measure of wasted activation -- your agent runs but does not work.

Most people who complete this audit discover their effectiveness scores are 20-40 percentage points below their reliability scores. The agents fire, but the outcomes don't follow. This is not a reason for discouragement. It is a measurement result that makes improvement possible. You cannot optimize what you have not measured, and until now, you were measuring the wrong thing.

From effectiveness to efficiency

Effectiveness asks whether the right outcome occurs. It does not ask how much that outcome costs in time, energy, or opportunity. An agent that takes three hours to produce a result that could be produced in thirty minutes is fully effective and wildly inefficient.

That is the territory of the next lesson. L-0547 introduces time-to-fire metrics -- measuring how quickly each agent responds to its trigger. Where effectiveness ensures the agent produces the intended outcome, efficiency metrics ensure it does so without unnecessary cost. You need both: an agent that produces the right result slowly is effective but may not be sustainable, and an agent that fires instantly but produces the wrong result is efficient at wasting your time.

Effectiveness comes first in the measurement sequence because efficiency without effectiveness is precisely what Drucker warned against: doing the wrong thing with great skill. Establish that your agent produces the intended outcome. Then, and only then, optimize the speed and cost of getting there.