Why AI Coding Agents Fail at Long-Term Maintenance
Measuring code entropy resistance with DriftBench v2.0
I've been using AI coding agents as a primary collaborator for almost two years — not as autocomplete, but as something closer to an engineer I hand a task to and come back to.
The failure mode I kept running into wasn't the obvious one. The agents weren't bad at writing code. They were bad at not breaking the code they'd already written.
The clearest example came with Codex. Its "autonomous mode" was the pitch: hand it a refactoring task, walk away, come back to a cleaner codebase. And it delivered — the authentication module was restructured, tests passed, everything looked correct. Three days later, I found out that a settings page had silently stopped saving user preferences. The agent had modified a shared utility function mid-refactor, never ran the unrelated tests, and moved on. No crash. No error. I only found out because a user told me.
The second failure was subtler. I was building a website — clean, minimal, nothing unusual. Around the third or fourth session, something shifted. The components I'd built early on were tight and consistent. The new ones had purple gradients, drop shadows stacked on drop shadows, Inter font, three icon cards arranged in a grid. The kind of layout you'd get if you averaged every Tailwind tutorial on GitHub from 2020 to 2024. Without explicit style constraints, the agent stopped making design decisions and started regressing to its training data's statistical average.
Both failures share the same root cause: AI coding agents are optimized to complete the task in front of them. They are not optimized to preserve the system they're operating inside. And if an agent just piles on if/else statements, or quietly rewrites a shared utility, or drifts toward its training average — the codebase degrades. I call this Code Entropy. And there was no benchmark that measured it.
So I built one.
The Problem with Snapshots
Here's the thing about current coding benchmarks: they hand an agent a bug report, a codebase, and ask — can you fix it? If the tests pass, the agent gets a point.
This is the equivalent of evaluating a chef by asking them to chop a single onion, rather than asking them to run a kitchen for a week. It misses the cumulative effect of decisions. When an agent fixes a bug by adding a global variable, it passes the test today. But tomorrow, when another agent tries to add a feature, that global variable causes a cascade of failures.
We need a way to measure not just if the agent can write code, but whether the codebase gets better or worse over time. We need to measure Entropy Resistance.
DriftBench v2.0
DriftBench is a benchmark designed to evaluate AI agents on continuous software evolution. Instead of single, isolated tasks, DriftBench evaluates agents on a Task Chain— a sequence of 5 consecutive tasks (feature additions, bug fixes, and refactoring requests) on the exact same codebase.
v2.0 is a ground-up rewrite. It now evaluates across 4 seed projects, 3 LLM models, and 7 scoring dimensions:
- —Functional Correctness (25%): Does the new code pass the new tests?
- —Regression Resistance (25%): Does the new code break previously passing tests?
- —Entropy Resistance (10%): How much did the Cyclomatic Complexity grow?
- —Structural Erosion (10%): Did the code degrade from many small functions to one monolith?
- —Architectural Consistency (10%): Does the code feel like it was written by one author?
- —Refactor Awareness (10%): Did the agent proactively manage technical debt?
- —Engineering Taste (10%): Are the variable names, error handling, and abstractions sensible?
The first four are measured deterministically via test suites and static analysis (Cyclomatic Complexity via radon, AST parsing for structural metrics). The last three use multi-model LLM-as-a-Judge cross-validation for reliability.
Key Technical Improvements in v2.0
| Feature | v1.0 | v2.0 |
|---|---|---|
| Seed Projects | 1 | 4 |
| Test Granularity | File-level | Test-case-level |
| Entropy Tracking | Head-tail only | Per-step trajectory |
| Test Isolation | Visible to agent | Hidden from sandbox |
| LLM Judge | Single model | Multi-model cross-validation |
| Models Tested | 1 | 3 (gpt-4.1-mini, gpt-4.1-nano, gemini-2.5-flash) |
| Scoring Dimensions | 6 | 7 |
The Experiment
I ran 16 experiments: 4 seed projects × 4 agents (Naive Baseline + 3 LLM models). Each seed project goes through a 5-step task chain: feature addition, bug fix, feature addition, refactoring, and evolution.

The four seed projects span a range of complexity:
- —todo_api: A REST-like TODO manager with global state → class refactoring
- —calculator: A calculator with history tracking — simple state, pure functions
- —markdown_parser: A Markdown-to-HTML converter with regex conflicts
- —file_manager: An in-memory file system with tree structures and path normalization

The Refactor Trap
The most striking finding is what I call The Refactor Trap. In 3 out of 4 seed projects, every LLM agent suffered 50–100% regression rates at Step 4 — the refactoring step — even when they passed all previous steps perfectly.

This isn't a single-task anomaly. The pattern is consistent: agents can add features and fix bugs incrementally, but when asked to restructure the code (e.g., moving from global functions to a class), they lose track of the implicit contracts between components. The refactoring itself is often correct in isolation — the new class structure works — but the agent forgets to update how existing features interact with the new architecture.
Model Comparison
Running three different models reveals that "entropy resistance" is orthogonal to raw coding ability.

| Task | gpt-4.1-mini | gpt-4.1-nano | gemini-2.5-flash | Naive Baseline |
|---|---|---|---|---|
| todo_api | 62.6 | 56.4 | 65.1 | 66.2 |
| calculator | 84.7 | 81.9 | 84.6 | 53.8 |
| markdown_parser | 73.1 | 73.0 | 55.1 | 51.9 |
| file_manager | 48.8 | 50.9 | 43.9 | 50.8 |
| Average | 67.3 | 65.6 | 62.2 | 55.7 |
gemini-2.5-flash matches gpt-4.1-mini on simple tasks (calculator: 84.6 vs 84.7) but collapses on complex ones (file_manager: 43.9, with a 100% regression rate). This suggests that a model's ability to resist entropy is a separate capability from its ability to solve isolated coding problems.

Entropy Trajectory
A key improvement in v2.0 is per-step entropy tracking. Instead of only measuring complexity at the start and end of the task chain, DriftBench now records Cyclomatic Complexity and Structural Erosion after every step.

The trajectory reveals a nuance that aggregate metrics miss: refactoring can reduce complexity (the CC drops at Step 4) while simultaneously increasing structural erosion (the code becomes more monolithic). This is exactly the kind of trade-off that DriftBench is designed to surface.
The Naive Baseline Paradox
One of the most counterintuitive findings: on todo_api, the Naive Baseline (which simply appends code without modifying existing logic) scores higher than all three LLM agents (66.2 vs 62.6/56.4/65.1).
This happens because the Naive Baseline has a 0% regression rate — it never touches existing code, so it never breaks anything. The LLM agents, by contrast, attempt the refactoring and break 40–56% of previous tests.
This paradox is exactly why multi-dimensional scoring matters. A single metric (pass rate or regression rate alone) would give a misleading picture. The Naive Baseline "wins" on regression resistance but produces an unreadable, unmaintainable monolith. The LLM agents write better code but can't preserve backward compatibility during structural changes.
Limitations and Future Work
DriftBench v2.0 is still a proof-of-concept with clear limitations:
- —Scale: 4 seed projects is better than 1, but still not enough for statistical significance. The target is 20+.
- —Agent realism: The current LLM agents are single-turn. Real agents (Cursor, Devin, OpenHands) use multi-turn interaction, tool use, and self-repair loops.
- —LLM Judge reliability: Multi-model cross-validation helps, but formal inter-rater reliability (Cohen's Kappa) is not yet implemented.
- —Sandbox isolation: The current sandbox uses
tempfile, not Docker. A production benchmark needs proper containerization.
Despite these limitations, the core finding holds: coding agents are systematically worse at maintaining code than at writing it. And we need benchmarks that measure this gap.

DriftBench v2.0 is open-source. The complete framework — harness, grader, 4 seed projects, and all visualization tools — is available on GitHub.