← All posts Engineering

Coding Agent Evaluation Metrics for Real Repos

The short answer: coding agent evaluation metrics should start with your own repositories, not a public leaderboard. Use public benchmarks to understand the market, then decide adoption with a private scorecard that measures task resolution, maintainer acceptance, cost per accepted task, time to usable PR, intervention rate, regressions, partial progress, and process quality.

That matters because a benchmark pass is not the same as a mergeable change. By 2026, the stronger evidence points in one direction: tests are necessary, but they are not enough to decide whether a coding agent belongs in your engineering workflow.

Why Leaderboard Scores Are Not Rollout Proof

SWE-bench changed the discussion because it moved evaluation closer to real software work. Instead of asking a model to solve a small function-level exercise, it gives an agent a GitHub repository and an issue, then checks whether the submitted patch passes fail-to-pass tests while avoiding pass-to-pass regressions.

That is useful. It is also incomplete.

SWE-bench Verified narrowed the original benchmark to 500 human-filtered instances, and its official reporting includes submitted tasks, completed tasks, resolved tasks, unresolved tasks, empty patches, errors, and resolution rate. Those metrics give buyers a cleaner comparison than many older code-generation tests.

The problem is that high public scores can become noisy. OpenAI stopped recommending SWE-bench Verified for frontier launches after finding contamination risk, flawed task design, and material test or problem issues in 59.4% of 138 frequently failed audited tasks. METR later found that roughly half of SWE-bench Verified passing PRs from mid-2024 to mid or late 2025 agents would not be merged by maintainers. Maintainer merge decisions were about 24 percentage points lower than automated SWE-bench scores.

Implication: a public resolve rate can be a market signal. It should not be your production approval gate.

What The Main Benchmarks Actually Measure

Before choosing metrics, separate the benchmark type from the adoption question. A coding agent that performs well on issue repair may still struggle with a migration, an end-to-end project build, or a review process with strict ownership rules.

Benchmark or approach Evaluation unit Useful signal Main limitation for rollout
SWE-bench and SWE-bench Verified GitHub issue repair Patch resolves issue tests and avoids known regressions Public exposure, test artifacts, and mergeability gap
SWE-Bench Pro Professional repo repair tasks Contamination-resistant resolve rate across more diverse repos Still not your repo, team, CI, or review standard
Terminal-Bench 2.0 Hard command-line tasks in unique environments Runtime, cost, token use, failures, and resolution confidence intervals Non-determinism and public-repo contamination risk remain
ProjDevBench End-to-end project construction Architecture, functional correctness, and refinement quality Reported overall acceptance was 27.38%, showing the gap from issue repair to full project work
SWE-EVO Release-sized software evolution Large change behavior and partial progress through Fix Rate Still an emerging benchmark pattern

The common lesson is not that one benchmark is right and the others are wrong. The lesson is that each benchmark answers a narrower question than an engineering leader usually has.

Your real question is different: can this agent complete representative work in your repositories at an acceptable level of cost, latency, regression risk, maintainability, and human oversight?

Coding Agent Evaluation Metrics For Your Own Repos

A practical scorecard should combine outcome metrics, review metrics, cost metrics, time metrics, intervention metrics, and process metrics. The table below is the core set.

Metric What it tells you How to read it
task_resolve_rate Percentage of representative repo tasks that pass required tests Useful baseline, but never sufficient alone
maintainer_acceptance_rate Percentage of agent changes humans would merge The best bridge between tests and production trust
cost_per_accepted_task Total agent, API, and CI cost divided by accepted tasks More meaningful than raw token spend
time_to_usable_pr Elapsed time until an acceptable patch or PR exists Captures whether the workflow helps delivery speed
human_intervention_rate Approvals, interrupts, takeovers, and clarification loops Shows how autonomous the workflow really is
regression_rate Pass-to-pass failures, CI failures after merge, reverts, and incidents Measures the hidden cost of low-quality automation
partial_progress Useful failed attempts, fixed tests, or reusable scaffolding Prevents binary scoring from undervaluing useful work
process_quality Plan quality, verification coverage, recovery behavior, abstention, and atomic commits Explains whether success is repeatable or fragile

Comparatively, a simple benchmark score answers "did the tests pass?" This scorecard answers "would we trust this workflow with more of our backlog?"

Build The Evaluation Set From Real Engineering Work

A repo-grounded eval should be built from historical issues, bug fixes, test failures, migrations, and small features. The tasks should resemble the work you may actually delegate.

Start with a balanced sample, not the easiest tickets in the backlog. Include documentation updates, bug fixes, test additions, feature work, refactors, dependency updates, and risky areas of the codebase. The MSR 2026 task-stratified PR study found documentation tasks had 82.1% acceptance while features had 66.1% acceptance. A global acceptance score can therefore flatter an agent that receives easier work.

For each task, record the expected behavior, repo state, allowed tools, timeout, test commands, review rubric, and the human intervention policy. If the agent can ask clarifying questions, score that as part of the workflow. If it can only produce a patch from static issue text, score that separately.

Score Review Quality Separately From Test Results

Automated tests give you a repeatable signal. Maintainer review gives you the signal that matters before merge.

Use a review rubric that forces reviewers to separate rejection reasons. A patch may pass tests but still be rejected because it is too broad, hard to maintain, inconsistent with local style, missing edge cases, unsafe under concurrency, or solving the wrong problem. Those are different failures and should be counted differently.

A useful review rubric can use five decisions:

  1. Merge as is: acceptable without material changes.
  2. Merge after small edits: conceptually correct with minor cleanup.
  3. Needs revision: useful direction, but not ready.
  4. Reject: wrong, risky, or too costly to repair.
  5. Abstain should have happened: the agent should not have attempted the task under the given context.

That last category matters. A coding agent that knows when to stop can be safer than one that always produces a confident patch.

Measure Cost Per Accepted Task, Not Cost Per Token

Token cost is only one input. The business metric is cost per accepted task.

Include model or API cost, agent runtime, CI minutes, review time, rework time, and any incident or rollback cost tied to the change. Terminal-Bench reports cost, runtime, token usage, timeout patterns, and failure patterns because these factors change the operational answer. GitTaskBench goes further by combining success rate, token cost, and developer salaries into an economic benefit estimate.

The calculation can be simple:

cost_per_accepted_task =
  (agent_cost + ci_cost + review_cost + rework_cost) / accepted_tasks

If Agent A has a higher resolve rate but needs twice as much review time, it may be worse than Agent B for your team. If Agent C fails often but leaves useful tests and scaffolding, it may still be worth using for a narrow task class. The metric has to match the workflow, the invoice alone.

Track Time To Usable PR And Human Intervention

GitHub Copilot cloud-agent documentation points teams toward PR lifecycle metrics such as total PRs created and merged, Copilot-created PRs merged, and median time to merge. That is the right category of measurement because it ties agent work to delivery flow.

For internal evaluation, split timing into stages:

  1. Time from assignment to first patch.
  2. Time from first patch to passing local tests.
  3. Time from patch to review-ready PR.
  4. Time from review to accepted change.
  5. Time from merge to any regression or revert.

Then record human involvement at each stage. Count approvals, clarification loops, interrupts, takeovers, and manual repairs. Anthropic's agent eval guidance is relevant here because it treats an agent as the model plus the harness. A high-performing model in a weak harness may create more supervision burden than a slightly weaker model in a better workflow.

Use Partial Progress Without Letting It Hide Failure

Binary pass or fail scoring is clean, but it can miss value. SWE-EVO adds Fix Rate as a partial-progress metric for release-sized software evolution, where a task may involve many files and hundreds of tests. That is closer to how large engineering work behaves.

Score partial progress only when it is useful to a human maintainer. Examples include a new failing test that captures the bug, a correct diagnosis, a narrow patch that fixes part of the issue, or scaffolding that reduces future work. Do not count noisy edits, broad rewrites, or code that creates review debt.

A simple scale works:

Score Meaning
0 No useful progress
1 Useful diagnosis or test, but no safe patch
2 Partial safe patch that reduces remaining work
3 Nearly acceptable patch with limited review fixes needed

Keep partial progress separate from acceptance. Otherwise, a team can talk itself into shipping almost-correct changes.

Stratify Results Before You Choose A Model Or Workflow

No single global number is enough. The MSR 2026 study warned that observational PR data can confound agent ability with task assignment. In plain terms: if one agent gets mostly documentation tasks and another gets feature work, the aggregate acceptance rate is not a fair comparison.

Stratify every evaluation by at least these dimensions:

  • Task type: docs, tests, bug fix, feature, migration, refactor.
  • Risk level: low-risk internal change, customer-facing change, security-sensitive change.
  • Change size: files touched, lines changed, test count affected.
  • Repo area: owned service, shared library, legacy module, build system.
  • Language and framework: Python, TypeScript, Java, Go, mobile, infrastructure.
  • Interaction mode: autonomous, human-approved, interactive, shadow PR only.

This makes the decision more precise. You may find that an agent is ready for test generation and documentation, useful with approval for small bug fixes, and not ready for cross-service migrations.

A Practical Rollout Model

The rollout should move from comparable public evidence to private evidence, then to limited production use.

  1. Use public benchmarks as a first filter. SWE-Bench Pro, Terminal-Bench, ProjDevBench, and similar evals can identify credible systems and expose broad weaknesses.
  2. Run a private repo eval. Use historical tasks with known outcomes, representative tests, and a human review rubric.
  3. Require maintainer scoring. Count mergeability, rejection reasons, review time, and abstention quality.
  4. Run shadow PRs. Let the agent produce changes without merging them. Compare against human fixes where possible.
  5. Open a narrow pilot. Start with task classes where the scorecard shows clear value and low regression risk.
  6. Monitor after merge. Track CI failures, reverts, incidents, latency, cost per accepted task, and intervention rate over time.

The Scorecard To Use

For each agent or agent workflow, record the following:

Category Decision metric
Correctness Task resolve rate, pass-to-pass regression rate, CI pass rate
Review Maintainer acceptance rate, rejection reasons, review time
Economics Cost per accepted task, CI cost, rework cost
Speed Time to usable PR, median time to merge
Oversight Human intervention rate, clarification loops, takeovers
Reliability Reverts, incidents, flaky-test interactions, timeout rate
Process Plan quality, verification coverage, recovery behavior, abstention

The strongest adoption signal is not a single high score. It is a stable pattern: the agent solves the right task classes, maintainers accept the work, regressions stay low, cost is predictable, and human supervision falls instead of growing.

That is the practical standard for coding agent evaluation metrics. Public benchmarks tell you who deserves a closer look. Your own scorecard tells you who deserves repository access.

Get started

Deploy your fleet.

Put a fleet of sandboxed agents to work on your own infrastructure, provisioned in seconds and watched live from one console.

Get started

Admin-provisioned · Self-host in one command · Your data never leaves your VM