Agent versioning

Your agents are already changing. The question is whether you can see it.

You have processes you rely on. A way you make hiring decisions. A protocol for how you handle conflict. A sequence you run every Sunday evening to set up the week. A heuristic you apply when evaluating whether to say yes to a new commitment.

These are agents — executable patterns that take input, apply logic, and produce output. And they are not static. The way you handle conflict in 2026 is not the way you handled it in 2023. Your weekly review today looks different than it did six months ago. Something changed. Probably multiple things changed, at multiple points, for multiple reasons.

But can you say exactly what changed? Can you point to the moment the update happened? Can you compare the current version to the previous one and explain which performs better, and why? Can you roll back to an earlier version if the latest change made things worse?

If you cannot, your agents are evolving in the dark. You are running unversioned cognitive software — shipping updates with no changelog, no rollback path, and no way to learn from the transition itself.

In L-0586, you learned the difference between evolving an agent incrementally and replacing it entirely. This lesson gives you the mechanism that makes both operations trackable: explicit versioning. You label your agent versions, preserve them, and maintain the diffs between them — so that every change becomes a data point you can inspect rather than a silent overwrite you cannot recover.

What version control actually solved

In April 2005, Linus Torvalds built the first version of Git in roughly ten days. He did it because the Linux kernel — one of the largest collaborative software projects in history — had lost access to BitKeeper, the proprietary tool it had been using for version control. Thousands of developers were contributing code simultaneously, and without version control, they had no way to track what changed, who changed it, when it happened, or how to undo it.

Torvalds' design philosophy was radical for the time: every developer should have a complete copy of the entire project history. Not just the latest version — every version, every branch, every merge. His reasoning was that you cannot make intelligent decisions about the future of a system if you cannot inspect its past. "Being truly distributed means forks are non-issues," Torvalds explained. Anyone can take a copy, experiment, and merge the results back — or not. The version history preserves every path, including the ones that were abandoned.

Before Git and its predecessors, software development suffered from what Tom Preston-Werner — GitHub co-founder — later called "dependency hell." When interconnected components could not communicate whether an update was safe or breaking, the entire system became fragile. Preston-Werner's response was the Semantic Versioning specification (SemVer), now used by virtually every open-source project on earth. The scheme encodes the nature of a change into the version number itself:

PATCH (1.0.0 to 1.0.1): A small fix. The system works the same way; you corrected an error.
MINOR (1.0.0 to 1.1.0): A new capability added, but nothing existing was broken. Old behavior still works.
MAJOR (1.0.0 to 2.0.0): A breaking change. The system behaves fundamentally differently. Consumers must adapt.

The insight behind SemVer is that not all changes are equal, and treating them as equal causes downstream failures. A patch to your morning routine (adjusting your alarm by fifteen minutes) has different implications than a major rewrite (shifting from reactive email-first mornings to proactive deep-work-first mornings). Without a version scheme that captures this distinction, you cannot assess the magnitude of what changed or predict the downstream effects.

Machine learning discovered the same lesson the hard way

Software engineers built version control for code. Then machine learning engineers discovered they needed version control for everything else, too — models, datasets, hyperparameters, evaluation metrics, and the pipelines connecting all of them.

MLflow, originally developed at Databricks, created a model registry that maintains the complete lifecycle of every model version. Each version records the exact parameters, training data, and performance metrics that produced it. When a new version is deployed, the old version is not deleted — it is archived, available for comparison and rollback. As of MLflow 3.0, this registry has extended to handle generative AI applications and AI agents, connecting models to exact code versions, prompt configurations, and evaluation runs.

DVC (Data Version Control) solved the adjacent problem: versioning the data and pipelines that produce models. DVC works alongside Git, storing lightweight metafiles in the repository that point to the actual large datasets stored elsewhere. Every version of your training data, every pipeline configuration, every intermediate artifact gets a traceable snapshot. You can check out any previous state of the entire system — code, data, model, and results — and reproduce it exactly.

Weights & Biases took this further, creating an experiment tracking platform that captures everything needed to compare and reproduce model iterations: git commits, hyperparameters, model weights, performance metrics, and sample predictions. Their artifact system provides lineage tracking — the ability to trace exactly which data version produced which model version with which results.

These tools exist because ML engineers learned through painful experience what happens without them. You train a model, it performs well, you tweak something, it performs worse. Without versioning, you cannot get back to the version that worked. Without tracking what changed between versions, you cannot learn why one outperformed another. And without lineage tracking, you cannot determine whether an improvement came from better data, different hyperparameters, or random chance.

The parallel to your personal agents is direct. You have a conflict-resolution approach that works reasonably well. You adjust it after reading a book on negotiation. It works worse. Without versioning, you cannot get back to the approach that worked. Without a diff, you cannot determine which specific change caused the regression. Without a changelog, three months later you cannot even remember that the change happened.

Changelogs are the artifact that matters most

The Keep a Changelog project, created by Olivier Lacan, established a principle that the software industry spent decades learning: changelogs are for humans, not machines. A good changelog does not dump every technical detail. It organizes changes into categories that tell a reader what actually matters:

Added — new capabilities
Changed — modifications to existing behavior
Deprecated — features marked for future removal
Removed — features that no longer exist
Fixed — corrections to broken behavior

The Conventional Commits specification took this further by standardizing commit messages so that the changelog can be partially automated. Each commit begins with a type — feat: for new features, fix: for bug fixes, BREAKING CHANGE: for changes that alter the contract.

For your personal agents, the changelog is arguably more important than the version number. A version number tells you that something changed. A changelog tells you what changed, why, and what to watch for. Consider:

Agent: Weekly review process v1.0 (2025-03): Review task list, clear inbox, plan next week. ~90 minutes. v1.1 (2025-06): Added: energy-level check before planning. Changed: plan next week based on energy forecast, not just task priority. Trigger: kept scheduling deep work on days when energy was consistently low. v2.0 (2025-10): Breaking change — restructured entirely around outcomes instead of tasks. Review now starts with "What did I actually accomplish?" rather than "What did I plan?" Added: outcome-versus-intention comparison. Removed: task-by-task review. Trigger: realized I was completing tasks without making progress on anything that mattered. v2.1 (2026-01): Fixed: outcome review was taking too long because I wasn't preloading data. Added: five-minute data-gathering step at the start (pull calendar, check project notes, review commitments). Trigger: three consecutive reviews where I spent 30 minutes just finding information.

Four versions. Ten months. A complete lifecycle story. Anyone reading this — including your future self — can see the trajectory, understand the reasoning behind each change, evaluate whether the changes were improvements, and roll back to any previous version if the current one fails.

The callback to schema versioning — and why agents are different

In L-0305, you learned to version your schemas — the mental models and beliefs that structure your thinking. Schema versioning tracks how your understanding changes. You applied Semantic Versioning to beliefs: patch-level refinements, minor additions of nuance, major rewrites when the entire frame had to be replaced.

Agent versioning borrows the same mechanics but applies them to a different layer. Schemas are what you believe. Agents are what you do. A schema is your model of how motivation works. An agent is the specific protocol you run every Monday morning to set your priorities based on that model.

The relationship matters because schema changes often demand agent changes, but not always vice versa. When your schema about motivation undergoes a major version change — say, from "people are motivated by rewards" to "people are motivated by autonomy, mastery, and purpose" — every agent downstream that was built on the old schema needs re-evaluation. Your hiring agent, your delegation agent, your feedback agent all carry assumptions from the old schema that may now be wrong.

But an agent can also version independently of its underlying schema. Your weekly review process might go through five versions while your schema about productivity stays the same — because you are optimizing the execution, not changing the theory. Tracking both version histories lets you distinguish between these cases. When an agent version fails, the diagnostic question is: did the execution change (agent-level issue), or did the underlying model change (schema-level issue)? Without separate version histories for both, you cannot answer that question.

The rollback test

Here is the practical test that separates real version control from the illusion of it: can you roll back?

If your current conflict resolution approach is not working, can you articulate and re-execute the previous version? Not a vague sense of "I used to handle this differently." The actual previous protocol, step by step, with enough fidelity that you could run it again.

If you cannot, you have version labels but not version control. The distinction matters. Version labels are metadata. Version control is the ability to retrieve, compare, and restore previous states.

Software learned this distinction through catastrophic failures. Deployment goes wrong, the new version is broken, you need to roll back to the version that was running an hour ago. If your version control system cannot produce that previous version instantly and completely, you have a crisis. The entire discipline of continuous deployment — shipping code to production dozens of times per day — depends on the guarantee that any deployment can be reversed in minutes.

Your cognitive agents operate under the same logic, just at a slower timescale. You updated your approach to managing your team. The new version is producing worse outcomes. Can you restore the previous version cleanly enough to stop the bleeding while you diagnose the problem? Or has the old version been silently overwritten, leaving you with no fallback except trying to remember what you used to do?

The rollback test also reveals a deeper truth: some agents are not worth versioning. If an agent is trivial — your process for choosing which coffee to order — the cost of version control exceeds the benefit. The agents worth versioning are the ones where a bad version produces meaningful negative consequences and where the ability to compare versions would generate genuine learning.

How to version your agents in practice

The protocol maps directly from software engineering to personal infrastructure:

1. Name your agents. You cannot version what you have not identified. Give each significant agent a name: "Weekly Review," "Conflict Protocol," "Hiring Evaluation," "Morning Sequence," "Decision Framework for New Commitments." The name makes it addressable.

2. Capture the current version. Write down what the agent actually does — not what you aspire for it to do, but what you actually execute. Be specific enough that someone else could run it. This is your v_current.

3. Assign a version number. If this is the first time you are documenting it, call it v1.0 — even if you know it has changed before. You are starting version control now; what came before is unversioned history.

4. When you change it, increment the version and write the diff. Small adjustments — tuning a parameter, adding a minor step — are patches (v1.0 to v1.1). Adding significant new capability without breaking the core structure is a minor version (v1.0 to v2.0 is too aggressive; v1.1 to v1.2 is appropriate). Restructuring the entire agent — changing its core logic, its inputs, or its outputs — is a major version (v1.x to v2.0).

5. Record the trigger. Every version change has a cause. What happened that made you update? A failure? New information? A change in context? The trigger is the most valuable part of the changelog because it connects the version history to your actual experience.

6. Preserve old versions. Do not overwrite. Archive. The old version is not an error to be deleted — it is a waypoint that makes your trajectory visible. You may need it for rollback. You will certainly need it for learning.

7. Review periodically. Once a quarter, scan your versioned agents. Which ones have been stable? Which ones are churning? An agent at v7.3 in six months is either in a domain of rapid change or you are thrashing without converging. The version history tells you which.

What versioning makes visible

When your agents carry explicit version histories, patterns emerge that were previously invisible:

Rate of change. Some agents stabilize quickly and hold steady for years. Others require constant iteration. The version frequency tells you which domains of your life are settled and which are still in active development.

Correlated updates. When your communication agent and your leadership agent version at the same time, that is probably a single underlying shift expressing itself in two domains. Without versioning, you would see them as unrelated changes.

Patch fatigue. When an agent reaches v1.8 — eight patches on the same core architecture — it is often a signal that patches are no longer sufficient. The accumulated workarounds indicate that the next version should be a v2.0 rewrite, or the agent should be retired entirely.

Learning rate. The quality of your diffs improves over time. Early versions of your changelog will be vague ("changed because it wasn't working"). Later versions will be precise ("changed step 3 from open-ended brainstorming to time-boxed brainstorming because sessions were running over 45 minutes without convergence, and the constraint improved output quality in 4 out of 5 subsequent uses"). The increasing precision of your changelogs is itself a version history — of your ability to observe and articulate your own cognitive infrastructure.

This version history is exactly what you will need in L-0588, where the question shifts from "how do I track changes to my agents?" to "how do I know when an agent should stop being changed and start being retired?" The version log is the primary evidence. An agent that has been patched repeatedly, that keeps failing in the same ways, that has drifted so far from its original design that the v1.0 and the current version share almost nothing — that agent is not a candidate for v_next. It is a candidate for deprecation.

Your agents will change whether you track the changes or not. The only question is whether you can see the evolution happening — or whether each update silently overwrites the last, leaving you with no history, no comparison, and no way to learn from the transition itself.

Label the versions. Write the diffs. Keep the old ones. Your future self is the operator of your cognitive infrastructure, and they deserve a changelog.