Self-hosted infrastructure for a fleet of long-lived AI agents. Each agent runs in its own sandboxed container with a persistent workspace, internet access, and a pluggable driver.

Yes. Single-VM deploy on plain Docker with your own keys and data. No vendor lock-in, no per-seat SaaS tax, nothing leaves your stack.

Which agent engines (drivers) are supported?

One identical API across swappable brains: Vanilla, Opencode, Codex, and Claude Code. The driver is part of an agent's config and can be swapped anytime.

How do you keep my data and agents secure?

Every agent runs in a hardened container: read-only root filesystem, dropped capabilities, network egress filtering, and CPU/memory limits. LLM keys are stored server-side, encrypted, and never sent to the browser.

Can I watch what an agent is doing?

Yes. Every task streams over Server-Sent Events (assistant messages, tool calls, results, file changes, a live token meter), and reconnects resume where they left off.

Accounts are provisioned by admins; you self-host the stack. Use Get started, or sign in if you already have an account.

Coding Agent Evaluation Metrics for Real Repos

The short answer: coding agent evaluation metrics should start with your own repositories, not a public leaderboard. Use public benchmarks to understand the market, then decide adoption with a private scorecard that measures task resolution, maintainer acceptance, cost per accepted task, time to usable PR, intervention rate, regressions, partial progress, and process quality.

That matters because a benchmark pass is not the same as a mergeable change. By 2026, the stronger evidence points in one direction: tests are necessary, but they are not enough to decide whether a coding agent belongs in your engineering workflow.

Why Leaderboard Scores Are Not Rollout Proof

SWE-bench changed the discussion because it moved evaluation closer to real software work. Instead of asking a model to solve a small function-level exercise, it gives an agent a GitHub repository and an issue, then checks whether the submitted patch passes fail-to-pass tests while avoiding pass-to-pass regressions.

That is useful. It is also incomplete.

SWE-bench Verified narrowed the original benchmark to 500 human-filtered instances, and its official reporting includes submitted tasks, completed tasks, resolved tasks, unresolved tasks, empty patches, errors, and resolution rate. Those metrics give buyers a cleaner comparison than many older code-generation tests.

The problem is that high public scores can become noisy. OpenAI stopped recommending SWE-bench Verified for frontier launches after finding contamination risk, flawed task design, and material test or problem issues in 59.4% of 138 frequently failed audited tasks. METR later found that roughly half of SWE-bench Verified passing PRs from mid-2024 to mid or late 2025 agents would not be merged by maintainers. Maintainer merge decisions were about 24 percentage points lower than automated SWE-bench scores.

Implication: a public resolve rate can be a market signal. It should not be your production approval gate.

What The Main Benchmarks Actually Measure

Before choosing metrics, separate the benchmark type from the adoption question. A coding agent that performs well on issue repair may still struggle with a migration, an end-to-end project build, or a review process with strict ownership rules.

Benchmark or approach	Evaluation unit	Useful signal	Main limitation for rollout
SWE-bench and SWE-bench Verified	GitHub issue repair	Patch resolves issue tests and avoids known regressions	Public exposure, test artifacts, and mergeability gap
SWE-Bench Pro	Professional repo repair tasks	Contamination-resistant resolve rate across more diverse repos	Still not your repo, team, CI, or review standard
Terminal-Bench 2.0	Hard command-line tasks in unique environments	Runtime, cost, token use, failures, and resolution confidence intervals	Non-determinism and public-repo contamination risk remain
ProjDevBench	End-to-end project construction	Architecture, functional correctness, and refinement quality	Reported overall acceptance was 27.38%, showing the gap from issue repair to full project work
SWE-EVO	Release-sized software evolution	Large change behavior and partial progress through Fix Rate	Still an emerging benchmark pattern

The common lesson is not that one benchmark is right and the others are wrong. The lesson is that each benchmark answers a narrower question than an engineering leader usually has.

Your real question is different: can this agent complete representative work in your repositories at an acceptable level of cost, latency, regression risk, maintainability, and human oversight?

Coding Agent Evaluation Metrics For Your Own Repos

A practical scorecard should combine outcome metrics, review metrics, cost metrics, time metrics, intervention metrics, and process metrics. The table below is the core set.

Metric	What it tells you	How to read it
`task_resolve_rate`	Percentage of representative repo tasks that pass required tests	Useful baseline, but never sufficient alone
`maintainer_acceptance_rate`	Percentage of agent changes humans would merge	The best bridge between tests and production trust
`cost_per_accepted_task`	Total agent, API, and CI cost divided by accepted tasks	More meaningful than raw token spend
`time_to_usable_pr`	Elapsed time until an acceptable patch or PR exists	Captures whether the workflow helps delivery speed
`human_intervention_rate`	Approvals, interrupts, takeovers, and clarification loops	Shows how autonomous the workflow really is
`regression_rate`	Pass-to-pass failures, CI failures after merge, reverts, and incidents	Measures the hidden cost of low-quality automation
`partial_progress`	Useful failed attempts, fixed tests, or reusable scaffolding	Prevents binary scoring from undervaluing useful work
`process_quality`	Plan quality, verification coverage, recovery behavior, abstention, and atomic commits	Explains whether success is repeatable or fragile

Comparatively, a simple benchmark score answers "did the tests pass?" This scorecard answers "would we trust this workflow with more of our backlog?"

Build The Evaluation Set From Real Engineering Work

A repo-grounded eval should be built from historical issues, bug fixes, test failures, migrations, and small features. The tasks should resemble the work you may actually delegate.

Start with a balanced sample, not the easiest tickets in the backlog. Include documentation updates, bug fixes, test additions, feature work, refactors, dependency updates, and risky areas of the codebase. The MSR 2026 task-stratified PR study found documentation tasks had 82.1% acceptance while features had 66.1% acceptance. A global acceptance score can therefore flatter an agent that receives easier work.

For each task, record the expected behavior, repo state, allowed tools, timeout, test commands, review rubric, and the human intervention policy. If the agent can ask clarifying questions, score that as part of the workflow. If it can only produce a patch from static issue text, score that separately.

Score Review Quality Separately From Test Results

Automated tests give you a repeatable signal. Maintainer review gives you the signal that matters before merge.

Use a review rubric that forces reviewers to separate rejection reasons. A patch may pass tests but still be rejected because it is too broad, hard to maintain, inconsistent with local style, missing edge cases, unsafe under concurrency, or solving the wrong problem. Those are different failures and should be counted differently.

A useful review rubric can use five decisions:

Merge as is: acceptable without material changes.
Merge after small edits: conceptually correct with minor cleanup.
Needs revision: useful direction, but not ready.
Reject: wrong, risky, or too costly to repair.
Abstain should have happened: the agent should not have attempted the task under the given context.

That last category matters. A coding agent that knows when to stop can be safer than one that always produces a confident patch.

Measure Cost Per Accepted Task, Not Cost Per Token

Token cost is only one input. The business metric is cost per accepted task.

Include model or API cost, agent runtime, CI minutes, review time, rework time, and any incident or rollback cost tied to the change. Terminal-Bench reports cost, runtime, token usage, timeout patterns, and failure patterns because these factors change the operational answer. GitTaskBench goes further by combining success rate, token cost, and developer salaries into an economic benefit estimate.

The calculation can be simple:

cost_per_accepted_task =
  (agent_cost + ci_cost + review_cost + rework_cost) / accepted_tasks

If Agent A has a higher resolve rate but needs twice as much review time, it may be worse than Agent B for your team. If Agent C fails often but leaves useful tests and scaffolding, it may still be worth using for a narrow task class. The metric has to match the workflow, the invoice alone.

Track Time To Usable PR And Human Intervention

GitHub Copilot cloud-agent documentation points teams toward PR lifecycle metrics such as total PRs created and merged, Copilot-created PRs merged, and median time to merge. That is the right category of measurement because it ties agent work to delivery flow.

For internal evaluation, split timing into stages:

Time from assignment to first patch.
Time from first patch to passing local tests.
Time from patch to review-ready PR.
Time from review to accepted change.
Time from merge to any regression or revert.

Then record human involvement at each stage. Count approvals, clarification loops, interrupts, takeovers, and manual repairs. Anthropic's agent eval guidance is relevant here because it treats an agent as the model plus the harness. A high-performing model in a weak harness may create more supervision burden than a slightly weaker model in a better workflow.

Use Partial Progress Without Letting It Hide Failure

Binary pass or fail scoring is clean, but it can miss value. SWE-EVO adds Fix Rate as a partial-progress metric for release-sized software evolution, where a task may involve many files and hundreds of tests. That is closer to how large engineering work behaves.

Score partial progress only when it is useful to a human maintainer. Examples include a new failing test that captures the bug, a correct diagnosis, a narrow patch that fixes part of the issue, or scaffolding that reduces future work. Do not count noisy edits, broad rewrites, or code that creates review debt.

A simple scale works:

Score	Meaning
0	No useful progress
1	Useful diagnosis or test, but no safe patch
2	Partial safe patch that reduces remaining work
3	Nearly acceptable patch with limited review fixes needed

Keep partial progress separate from acceptance. Otherwise, a team can talk itself into shipping almost-correct changes.

Stratify Results Before You Choose A Model Or Workflow

No single global number is enough. The MSR 2026 study warned that observational PR data can confound agent ability with task assignment. In plain terms: if one agent gets mostly documentation tasks and another gets feature work, the aggregate acceptance rate is not a fair comparison.

Stratify every evaluation by at least these dimensions:

Task type: docs, tests, bug fix, feature, migration, refactor.
Risk level: low-risk internal change, customer-facing change, security-sensitive change.
Change size: files touched, lines changed, test count affected.
Repo area: owned service, shared library, legacy module, build system.
Language and framework: Python, TypeScript, Java, Go, mobile, infrastructure.
Interaction mode: autonomous, human-approved, interactive, shadow PR only.

This makes the decision more precise. You may find that an agent is ready for test generation and documentation, useful with approval for small bug fixes, and not ready for cross-service migrations.

A Practical Rollout Model

The rollout should move from comparable public evidence to private evidence, then to limited production use.

Use public benchmarks as a first filter. SWE-Bench Pro, Terminal-Bench, ProjDevBench, and similar evals can identify credible systems and expose broad weaknesses.
Run a private repo eval. Use historical tasks with known outcomes, representative tests, and a human review rubric.
Require maintainer scoring. Count mergeability, rejection reasons, review time, and abstention quality.
Run shadow PRs. Let the agent produce changes without merging them. Compare against human fixes where possible.
Open a narrow pilot. Start with task classes where the scorecard shows clear value and low regression risk.
Monitor after merge. Track CI failures, reverts, incidents, latency, cost per accepted task, and intervention rate over time.

The Scorecard To Use

For each agent or agent workflow, record the following:

Category	Decision metric
Correctness	Task resolve rate, pass-to-pass regression rate, CI pass rate
Review	Maintainer acceptance rate, rejection reasons, review time
Economics	Cost per accepted task, CI cost, rework cost
Speed	Time to usable PR, median time to merge
Oversight	Human intervention rate, clarification loops, takeovers
Reliability	Reverts, incidents, flaky-test interactions, timeout rate
Process	Plan quality, verification coverage, recovery behavior, abstention

The strongest adoption signal is not a single high score. It is a stable pattern: the agent solves the right task classes, maintainers accept the work, regressions stay low, cost is predictable, and human supervision falls instead of growing.

That is the practical standard for coding agent evaluation metrics. Public benchmarks tell you who deserves a closer look. Your own scorecard tells you who deserves repository access.