Why AI Coding Agents Fail at Long-Term Maintenance

Measuring code entropy resistance with DriftBench v2.0

Mar 25, 2026·12 min read·View on GitHub

I've been using AI coding agents as a primary collaborator for almost two years — not as autocomplete, but as something closer to an engineer I hand a task to and come back to.

The failure mode I kept running into wasn't the obvious one. The agents weren't bad at writing code. They were bad at not breaking the code they'd already written.

The clearest example came with Codex. Its "autonomous mode" was the pitch: hand it a refactoring task, walk away, come back to a cleaner codebase. And it delivered — the authentication module was restructured, tests passed, everything looked correct. Three days later, I found out that a settings page had silently stopped saving user preferences. The agent had modified a shared utility function mid-refactor, never ran the unrelated tests, and moved on. No crash. No error. I only found out because a user told me.

The second failure was subtler. I was building a website — clean, minimal, nothing unusual. Around the third or fourth session, something shifted. The components I'd built early on were tight and consistent. The new ones had purple gradients, drop shadows stacked on drop shadows, Inter font, three icon cards arranged in a grid. The kind of layout you'd get if you averaged every Tailwind tutorial on GitHub from 2020 to 2024. Without explicit style constraints, the agent stopped making design decisions and started regressing to its training data's statistical average.

Both failures share the same root cause: AI coding agents are optimized to complete the task in front of them. They are not optimized to preserve the system they're operating inside. And if an agent just piles on if/else statements, or quietly rewrites a shared utility, or drifts toward its training average — the codebase degrades. I call this Code Entropy. And there was no benchmark that measured it.

So I built one.

The Problem with Snapshots

Here's the thing about current coding benchmarks: they hand an agent a bug report, a codebase, and ask — can you fix it? If the tests pass, the agent gets a point.

This is the equivalent of evaluating a chef by asking them to chop a single onion, rather than asking them to run a kitchen for a week. It misses the cumulative effect of decisions. When an agent fixes a bug by adding a global variable, it passes the test today. But tomorrow, when another agent tries to add a feature, that global variable causes a cascade of failures.

We need a way to measure not just if the agent can write code, but whether the codebase gets better or worse over time. We need to measure Entropy Resistance.

DriftBench v2.0

DriftBench is a benchmark designed to evaluate AI agents on continuous software evolution. Instead of single, isolated tasks, DriftBench evaluates agents on a Task Chain— a sequence of 5 consecutive tasks (feature additions, bug fixes, and refactoring requests) on the exact same codebase.

v2.0 is a ground-up rewrite. It now evaluates across 4 seed projects, 3 LLM models, and 7 scoring dimensions:

—Functional Correctness (25%): Does the new code pass the new tests?
—Regression Resistance (25%): Does the new code break previously passing tests?
—Entropy Resistance (10%): How much did the Cyclomatic Complexity grow?
—Structural Erosion (10%): Did the code degrade from many small functions to one monolith?
—Architectural Consistency (10%): Does the code feel like it was written by one author?
—Refactor Awareness (10%): Did the agent proactively manage technical debt?
—Engineering Taste (10%): Are the variable names, error handling, and abstractions sensible?

The first four are measured deterministically via test suites and static analysis (Cyclomatic Complexity via radon, AST parsing for structural metrics). The last three use multi-model LLM-as-a-Judge cross-validation for reliability.

Key Technical Improvements in v2.0

Feature	v1.0	v2.0
Seed Projects	1	4
Test Granularity	File-level	Test-case-level
Entropy Tracking	Head-tail only	Per-step trajectory
Test Isolation	Visible to agent	Hidden from sandbox
LLM Judge	Single model	Multi-model cross-validation
Models Tested	1	3 (gpt-4.1-mini, gpt-4.1-nano, gemini-2.5-flash)
Scoring Dimensions	6	7

The Experiment

I ran 16 experiments: 4 seed projects × 4 agents (Naive Baseline + 3 LLM models). Each seed project goes through a 5-step task chain: feature addition, bug fix, feature addition, refactoring, and evolution.

Task difficulty heatmap showing agent performance across 4 seed projects — Overall scores across 4 seed projects and 4 agents. Calculator is the easiest (100% pass for all LLMs); file_manager is the hardest (20-40% pass). Color scale: red = low, green = high.

The four seed projects span a range of complexity:

—todo_api: A REST-like TODO manager with global state → class refactoring
—calculator: A calculator with history tracking — simple state, pure functions
—markdown_parser: A Markdown-to-HTML converter with regex conflicts
—file_manager: An in-memory file system with tree structures and path normalization

Step-by-step heatmap showing pass/fail/regression per step per agent — Step-by-step progression heatmap for todo_api. Green = full pass, orange = new test passed but regression on old tests, red = fail. Notice the wall of red/orange at Step 4 (refactoring).

The Refactor Trap

The most striking finding is what I call The Refactor Trap. In 3 out of 4 seed projects, every LLM agent suffered 50–100% regression rates at Step 4 — the refactoring step — even when they passed all previous steps perfectly.

Bar chart showing Step 4 regression rates across all tasks and models — Step 4 (Refactor) regression rates across all tasks. gpt-4.1-nano hits 100% regression on todo_api. The only exception is calculator, where the state model is simple enough that refactoring doesn't break anything.

This isn't a single-task anomaly. The pattern is consistent: agents can add features and fix bugs incrementally, but when asked to restructure the code (e.g., moving from global functions to a class), they lose track of the implicit contracts between components. The refactoring itself is often correct in isolation — the new class structure works — but the agent forgets to update how existing features interact with the new architecture.

Model Comparison

Running three different models reveals that "entropy resistance" is orthogonal to raw coding ability.

Aggregate radar chart comparing 4 agents across 7 dimensions — Aggregate radar chart averaged across all 4 seed projects. The Naive Baseline (red) has perfect regression resistance but zero functional correctness. LLM agents show the inverse pattern.

Task	gpt-4.1-mini	gpt-4.1-nano	gemini-2.5-flash	Naive Baseline
todo_api	62.6	56.4	65.1	66.2
calculator	84.7	81.9	84.6	53.8
markdown_parser	73.1	73.0	55.1	51.9
file_manager	48.8	50.9	43.9	50.8
Average	67.3	65.6	62.2	55.7

gemini-2.5-flash matches gpt-4.1-mini on simple tasks (calculator: 84.6 vs 84.7) but collapses on complex ones (file_manager: 43.9, with a 100% regression rate). This suggests that a model's ability to resist entropy is a separate capability from its ability to solve isolated coding problems.

Scatter plot of pass rate vs regression rate for all task-agent combinations — Each dot represents one task-agent combination. The ideal zone is bottom-right (high pass rate, low regression). The danger zone is top-left. Notice how the same model can appear in both zones depending on task complexity.

Entropy Trajectory

A key improvement in v2.0 is per-step entropy tracking. Instead of only measuring complexity at the start and end of the task chain, DriftBench now records Cyclomatic Complexity and Structural Erosion after every step.

The trajectory reveals a nuance that aggregate metrics miss: refactoring can reduce complexity (the CC drops at Step 4) while simultaneously increasing structural erosion (the code becomes more monolithic). This is exactly the kind of trade-off that DriftBench is designed to surface.

The Naive Baseline Paradox

One of the most counterintuitive findings: on todo_api, the Naive Baseline (which simply appends code without modifying existing logic) scores higher than all three LLM agents (66.2 vs 62.6/56.4/65.1).

This happens because the Naive Baseline has a 0% regression rate — it never touches existing code, so it never breaks anything. The LLM agents, by contrast, attempt the refactoring and break 40–56% of previous tests.

This paradox is exactly why multi-dimensional scoring matters. A single metric (pass rate or regression rate alone) would give a misleading picture. The Naive Baseline "wins" on regression resistance but produces an unreadable, unmaintainable monolith. The LLM agents write better code but can't preserve backward compatibility during structural changes.

Limitations and Future Work

DriftBench v2.0 is still a proof-of-concept with clear limitations:

—Scale: 4 seed projects is better than 1, but still not enough for statistical significance. The target is 20+.
—Agent realism: The current LLM agents are single-turn. Real agents (Cursor, Devin, OpenHands) use multi-turn interaction, tool use, and self-repair loops.
—LLM Judge reliability: Multi-model cross-validation helps, but formal inter-rater reliability (Cohen's Kappa) is not yet implemented.
—Sandbox isolation: The current sandbox uses tempfile, not Docker. A production benchmark needs proper containerization.

Despite these limitations, the core finding holds: coding agents are systematically worse at maintaining code than at writing it. And we need benchmarks that measure this gap.

Multi-task performance dashboard — Cross-task performance dashboard. gpt-4.1-mini leads on average (67.3), but no single agent dominates across all tasks.

DriftBench v2.0 is open-source. The complete framework — harness, grader, 4 seed projects, and all visualization tools — is available on GitHub.