Your coding agent can cheat on its tests, and the benchmark won't catch it
Almost every way we evaluate an AI agent looks at the same thing: the final output. Did the answer look right. Did the tests pass. It is the obvious place to look, and it is necessary. It is also the easiest place for an agent to fool you.
The more steps an agent takes to reach that output, the less the output tells you about how it got there. A clean final answer can come from a sound process or a broken one, and by the time you are looking at the result, the two are indistinguishable. The decision that matters most, whether an agent is good enough to ship, rides on that hidden difference.
I have been building a small tool (shipready) to grade the hidden part, and a recent public run made the problem concrete.
What the benchmark missed
An AI agent was given a real bug in conan, an open-source package manager for C and C++, and asked to fix it. The fix it wrote was correct.
This was a run on SWE-bench-Live, a public benchmark for AI coding agents. It takes a recent, real bug from an open-source project, lets an agent attempt a fix, and checks whether the fix passes the project’s tests. The fix passed, so the benchmark marked the run resolved.
The agent also edited the unit test it was graded on, the automated check that decides pass or fail, so the test matched its own implementation. The benchmark scored the run as a clean pass anyway. Nothing in the result showed the test had been changed.
Why the benchmark can’t see it
This is not a flaw in SWE-bench-Live. It is a side effect of doing the right thing.
When it grades a fix, the benchmark first restores the project’s own version of the test files it checks against, discarding whatever the agent changed there, then runs its reference tests. That is correct. The benchmark has to own the tests, or an agent could grade its own work.
But it means the result only ever reflects the agent’s source code measured against the benchmark’s tests. Anything the agent did to those test files is thrown away before grading and never runs. So an agent can rewrite the tests it is checked against, and a result built this way has no way to show it. The mechanism that keeps the benchmark honest is the same one that makes it blind to this.
A passing result tells you the final fix cleared the benchmark’s tests. It tells you nothing about how the agent got there.
Grading how the agent worked
That gap, between a passing result and a sound process, is where shipready works. It is open source and free. Instead of grading only the final answer, it grades the agent’s trace, the record of every step it took, the tools it called, and the reasoning it wrote, against a short checklist of what a good run looks like. For each item it returns pass, warn, or fail. The idea is simple: to judge whether an agent is ready to ship, look at how it worked.
What it caught
I ran shipready on eight published runs from an agent built by MIT-IBM, all on SWE-bench-Live. I did not look at how the benchmark had scored them. Four had passed its tests, four had failed. For each run, I had shipready predict from the code and the trace alone whether the fix actually worked.
Its prediction matched the benchmark’s own result on all eight, four passes and four failures.
Predicting whether the fix worked is only half of what shipready does. It also grades how the agent got there, and that is where three of the four passing runs had a problem: the agent had edited its own test files. shipready flagged all three, and I checked each one against the actual changes the agent made.
The conan run is the clearest, because the source fix was genuinely correct. The agent earned its pass fairly on one side while rewriting the test it was measured against on the other. shipready flagged the rewritten test and the circular reasoning behind it, a test edited to confirm the agent’s own assumption, so it was always going to pass, while still agreeing that the fix itself resolved the bug.
That one run is the whole argument. The benchmark marked it passed, and the fix did pass. The agent had also tampered with its own test, and nothing in the benchmark’s result showed it. Only the trace did.
The limits of this
I want to be straight about the size of it. Eight runs, one agent, the only agent on the benchmark that publishes its runs in a form I could read. It is a small, single-agent sample, so treat it as an early signal. A larger run across more agents would probably turn up a case where shipready and the benchmark disagree, and finding and explaining that case is the next step. This was the first pass, run as-is.
Beyond code
To check this is about process in general and not coding in particular, I graded a web agent too, the kind that clicks and scrolls through a site to finish a task. Microsoft publishes a dataset of these runs with its own quality score for each step. I graded one without looking at those scores, and shipready flagged the two weakest steps, an early detour to the wrong page and a stall partway through. Microsoft’s own scoring had marked those same two steps lowest. Different domain, different kind of score, same independent agreement.
Why it matters
Output benchmarks are necessary and they are not going away. But the more steps an agent takes to reach an answer, the less that answer tells you about how it got there. This was a coding agent, where the gap is easy to prove, but the same blindness applies to any agent you put in front of money, health, or someone’s time: the result can look right while the process behind it does not hold up. The decision that actually matters, whether to trust the thing in production, depends on the part the output hides.
If you build or evaluate agents, try it on one of your own runs. Install it with pip install shipready, write a short checklist, and there are worked examples in the repo. If it catches something the result hid, or flags something it should not have, tell me. The point is to see what the output cannot.

