Coding Agent Evaluation Metrics for Real Repos
The short answer: coding agent evaluation metrics should start with your own repositories, not a public leaderboard. Use public benchmarks to understand the market, then decide adoption with a private scorecard that measures task resolution, maintainer acceptance, cost per accepted task, time to usable PR, intervention rate, regressions, partial progress, and process quality.
That matters because a benchmark pass is not the same as a mergeable change. By 2026, the stronger evidence points in one direction: tests are necessary, but they are not enough to decide whether a coding agent belongs in your engineering workflow.
Why Leaderboard Scores Are Not Rollout Proof
SWE-bench changed the discussion because it moved evaluation closer to real software work. Instead of asking a model to solve a small function-level exercise, it gives an agent a GitHub repository and an issue, then checks whether the submitted patch passes fail-to-pass tests while avoiding pass-to-pass regressions.
That is useful. It is also incomplete.
SWE-bench Verified narrowed the original benchmark to 500 human-filtered instances, and its official reporting includes submitted tasks, completed tasks, resolved tasks, unresolved tasks, empty patches, errors, and resolution rate. Those metrics give buyers a cleaner comparison than many older code-generation tests.
The problem is that high public scores can become noisy. OpenAI stopped recommending SWE-bench Verified for frontier launches after finding contamination risk, flawed task design, and material test or problem issues in 59.4% of 138 frequently failed audited tasks. METR later found that roughly half of SWE-bench Verified passing PRs from mid-2024 to mid or late 2025 agents would not be merged by maintainers. Maintainer merge decisions were about 24 percentage points lower than automated SWE-bench scores.
Implication: a public resolve rate can be a market signal. It should not be your production approval gate.
What The Main Benchmarks Actually Measure
Before choosing metrics, separate the benchmark type from the adoption question. A coding agent that performs well on issue repair may still struggle with a migration, an end-to-end project build, or a review process with strict ownership rules.
| Benchmark or approach | Evaluation unit | Useful signal | Main limitation for rollout |
|---|---|---|---|
| SWE-bench and SWE-bench Verified | GitHub issue repair | Patch resolves issue tests and avoids known regressions | Public exposure, test artifacts, and mergeability gap |
| SWE-Bench Pro | Professional repo repair tasks | Contamination-resistant resolve rate across more diverse repos | Still not your repo, team, CI, or review standard |
| Terminal-Bench 2.0 | Hard command-line tasks in unique environments | Runtime, cost, token use, failures, and resolution confidence intervals | Non-determinism and public-repo contamination risk remain |
| ProjDevBench | End-to-end project construction | Architecture, functional correctness, and refinement quality | Reported overall acceptance was 27.38%, showing the gap from issue repair to full project work |
| SWE-EVO | Release-sized software evolution | Large change behavior and partial progress through Fix Rate | Still an emerging benchmark pattern |
The common lesson is not that one benchmark is right and the others are wrong. The lesson is that each benchmark answers a narrower question than an engineering leader usually has.
Your real question is different: can this agent complete representative work in your repositories at an acceptable level of cost, latency, regression risk, maintainability, and human oversight?
Coding Agent Evaluation Metrics For Your Own Repos
A practical scorecard should combine outcome metrics, review metrics, cost metrics, time metrics, intervention metrics, and process metrics. The table below is the core set.
| Metric | What it tells you | How to read it |
|---|---|---|
task_resolve_rate |
Percentage of representative repo tasks that pass required tests | Useful baseline, but never sufficient alone |
maintainer_acceptance_rate |
Percentage of agent changes humans would merge | The best bridge between tests and production trust |
cost_per_accepted_task |
Total agent, API, and CI cost divided by accepted tasks | More meaningful than raw token spend |
time_to_usable_pr |
Elapsed time until an acceptable patch or PR exists | Captures whether the workflow helps delivery speed |
human_intervention_rate |
Approvals, interrupts, takeovers, and clarification loops | Shows how autonomous the workflow really is |
regression_rate |
Pass-to-pass failures, CI failures after merge, reverts, and incidents | Measures the hidden cost of low-quality automation |
partial_progress |
Useful failed attempts, fixed tests, or reusable scaffolding | Prevents binary scoring from undervaluing useful work |
process_quality |
Plan quality, verification coverage, recovery behavior, abstention, and atomic commits | Explains whether success is repeatable or fragile |
Comparatively, a simple benchmark score answers "did the tests pass?" This scorecard answers "would we trust this workflow with more of our backlog?"
Build The Evaluation Set From Real Engineering Work
A repo-grounded eval should be built from historical issues, bug fixes, test failures, migrations, and small features. The tasks should resemble the work you may actually delegate.
Start with a balanced sample, not the easiest tickets in the backlog. Include documentation updates, bug fixes, test additions, feature work, refactors, dependency updates, and risky areas of the codebase. The MSR 2026 task-stratified PR study found documentation tasks had 82.1% acceptance while features had 66.1% acceptance. A global acceptance score can therefore flatter an agent that receives easier work.
For each task, record the expected behavior, repo state, allowed tools, timeout, test commands, review rubric, and the human intervention policy. If the agent can ask clarifying questions, score that as part of the workflow. If it can only produce a patch from static issue text, score that separately.
Score Review Quality Separately From Test Results
Automated tests give you a repeatable signal. Maintainer review gives you the signal that matters before merge.
Use a review rubric that forces reviewers to separate rejection reasons. A patch may pass tests but still be rejected because it is too broad, hard to maintain, inconsistent with local style, missing edge cases, unsafe under concurrency, or solving the wrong problem. Those are different failures and should be counted differently.
A useful review rubric can use five decisions:
- Merge as is: acceptable without material changes.
- Merge after small edits: conceptually correct with minor cleanup.
- Needs revision: useful direction, but not ready.
- Reject: wrong, risky, or too costly to repair.
- Abstain should have happened: the agent should not have attempted the task under the given context.
That last category matters. A coding agent that knows when to stop can be safer than one that always produces a confident patch.
Measure Cost Per Accepted Task, Not Cost Per Token
Token cost is only one input. The business metric is cost per accepted task.
Include model or API cost, agent runtime, CI minutes, review time, rework time, and any incident or rollback cost tied to the change. Terminal-Bench reports cost, runtime, token usage, timeout patterns, and failure patterns because these factors change the operational answer. GitTaskBench goes further by combining success rate, token cost, and developer salaries into an economic benefit estimate.
The calculation can be simple:
cost_per_accepted_task =
(agent_cost + ci_cost + review_cost + rework_cost) / accepted_tasks
If Agent A has a higher resolve rate but needs twice as much review time, it may be worse than Agent B for your team. If Agent C fails often but leaves useful tests and scaffolding, it may still be worth using for a narrow task class. The metric has to match the workflow, the invoice alone.
Track Time To Usable PR And Human Intervention
GitHub Copilot cloud-agent documentation points teams toward PR lifecycle metrics such as total PRs created and merged, Copilot-created PRs merged, and median time to merge. That is the right category of measurement because it ties agent work to delivery flow.
For internal evaluation, split timing into stages:
- Time from assignment to first patch.
- Time from first patch to passing local tests.
- Time from patch to review-ready PR.
- Time from review to accepted change.
- Time from merge to any regression or revert.
Then record human involvement at each stage. Count approvals, clarification loops, interrupts, takeovers, and manual repairs. Anthropic's agent eval guidance is relevant here because it treats an agent as the model plus the harness. A high-performing model in a weak harness may create more supervision burden than a slightly weaker model in a better workflow.
Use Partial Progress Without Letting It Hide Failure
Binary pass or fail scoring is clean, but it can miss value. SWE-EVO adds Fix Rate as a partial-progress metric for release-sized software evolution, where a task may involve many files and hundreds of tests. That is closer to how large engineering work behaves.
Score partial progress only when it is useful to a human maintainer. Examples include a new failing test that captures the bug, a correct diagnosis, a narrow patch that fixes part of the issue, or scaffolding that reduces future work. Do not count noisy edits, broad rewrites, or code that creates review debt.
A simple scale works:
| Score | Meaning |
|---|---|
| 0 | No useful progress |
| 1 | Useful diagnosis or test, but no safe patch |
| 2 | Partial safe patch that reduces remaining work |
| 3 | Nearly acceptable patch with limited review fixes needed |
Keep partial progress separate from acceptance. Otherwise, a team can talk itself into shipping almost-correct changes.
Stratify Results Before You Choose A Model Or Workflow
No single global number is enough. The MSR 2026 study warned that observational PR data can confound agent ability with task assignment. In plain terms: if one agent gets mostly documentation tasks and another gets feature work, the aggregate acceptance rate is not a fair comparison.
Stratify every evaluation by at least these dimensions:
- Task type: docs, tests, bug fix, feature, migration, refactor.
- Risk level: low-risk internal change, customer-facing change, security-sensitive change.
- Change size: files touched, lines changed, test count affected.
- Repo area: owned service, shared library, legacy module, build system.
- Language and framework: Python, TypeScript, Java, Go, mobile, infrastructure.
- Interaction mode: autonomous, human-approved, interactive, shadow PR only.
This makes the decision more precise. You may find that an agent is ready for test generation and documentation, useful with approval for small bug fixes, and not ready for cross-service migrations.
A Practical Rollout Model
The rollout should move from comparable public evidence to private evidence, then to limited production use.
- Use public benchmarks as a first filter. SWE-Bench Pro, Terminal-Bench, ProjDevBench, and similar evals can identify credible systems and expose broad weaknesses.
- Run a private repo eval. Use historical tasks with known outcomes, representative tests, and a human review rubric.
- Require maintainer scoring. Count mergeability, rejection reasons, review time, and abstention quality.
- Run shadow PRs. Let the agent produce changes without merging them. Compare against human fixes where possible.
- Open a narrow pilot. Start with task classes where the scorecard shows clear value and low regression risk.
- Monitor after merge. Track CI failures, reverts, incidents, latency, cost per accepted task, and intervention rate over time.
The Scorecard To Use
For each agent or agent workflow, record the following:
| Category | Decision metric |
|---|---|
| Correctness | Task resolve rate, pass-to-pass regression rate, CI pass rate |
| Review | Maintainer acceptance rate, rejection reasons, review time |
| Economics | Cost per accepted task, CI cost, rework cost |
| Speed | Time to usable PR, median time to merge |
| Oversight | Human intervention rate, clarification loops, takeovers |
| Reliability | Reverts, incidents, flaky-test interactions, timeout rate |
| Process | Plan quality, verification coverage, recovery behavior, abstention |
The strongest adoption signal is not a single high score. It is a stable pattern: the agent solves the right task classes, maintainers accept the work, regressions stay low, cost is predictable, and human supervision falls instead of growing.
That is the practical standard for coding agent evaluation metrics. Public benchmarks tell you who deserves a closer look. Your own scorecard tells you who deserves repository access.