← All posts Engineering

Agent Harness for Coding Agents: Runtime Architecture

A coding agent needs more than a model with shell access. The model proposes actions. The harness owns the operating layer that decides where those actions run, what they can touch, how long they can continue, what state survives, and how the resulting change reaches review.

That distinction matters when teams move from a local CLI agent to an operated system. In a local session, a developer can watch the terminal, approve commands, inspect diffs, and stop the run when it drifts. At team scale, agents run from issues, PR comments, schedules, CI events, or task queues. They run in parallel. They may run unattended for hours. The product risk shifts from "can the model code?" to "can the surrounding system control the work?"

An agent harness for coding agents is that surrounding system. It gives the agent a workspace, tool surface, sandbox, permission model, durable state, lifecycle controls, feedback channels, and review output. It is narrower than a generic agent framework and more operational than a prompt template.

What the harness owns beyond the model

The model-driven loop is simple. A task enters a driver. The driver calls the model. The model requests tools. The harness executes those tools in a workspace or sandbox. Results return to state. The loop repeats until completion, escalation, or failure.

The useful engineering work sits around that loop. A coding-agent harness owns the parts that make the loop safe enough and concrete enough for software delivery:

  • Task intake from issues, PR comments, tickets, schedules, CI events, or a custom queue.
  • Repository checkout, dependency setup, and workspace selection.
  • Shell, filesystem, browser, package manager, git, test runner, and MCP tool execution.
  • Permission policy for reads, writes, network calls, external directories, secrets, and destructive commands.
  • Technical isolation through local OS sandboxing, containers, microVMs, VMs, cloud sandboxes, or dedicated development environments.
  • Durable state outside the context window, including plans, logs, diffs, commits, traces, checkpoints, snapshots, and review comments.
  • Lifecycle controls for retry, pause, resume, cancel, timeout, cleanup, and cost limits.
  • Review output, usually a diff, branch, draft PR, status comment, or follow-up task.

This is why a coding agent harness is not the same thing as an agent orchestration framework. Frameworks such as LangGraph, OpenAI Agents SDK, Temporal-style workflows, and similar systems provide durable execution, tool calls, state, traces, handoffs, and human-in-the-loop patterns. Those are useful primitives. They do not, by themselves, solve repo setup, branch management, CI validation, filesystem policy, secret scoping, or PR handoff.

Why local CLI agents stop being enough

Local interactive agents work because the developer is part of the runtime. The person watches the run, answers prompts, notices bad assumptions, kills risky commands, and resolves conflicts. That is a reasonable operating model for one task in one repository.

It breaks down when the team wants agents to work as background labor. The hard problems become operational:

  • Multiple agents need separate workspaces so they do not overwrite each other.
  • Long tasks need state that survives context resets and process restarts.
  • Humans need compact review packets, not hours of transcript archaeology.
  • Secrets and network access need boundaries before an agent runs package installers or arbitrary shell commands.
  • Bad runs need cancellation, rollback, and cleanup, not a half-mutated checkout on a developer laptop.

The same market pattern shows up across Codex cloud, GitHub Copilot cloud agent, Claude Code GitHub Actions, Docker Sandboxes, E2B, Daytona, Windmill, Temporal sandbox orchestration examples, and OpenAI Symphony. The common shape is task queue or issue tracker, isolated workspace, model and tool loop, permission and network policy, lifecycle handling, then logs, diffs, PRs, and metrics.

That is the practical definition of an operated coding-agent system.

The reference architecture

A production-grade coding agent harness usually has eight layers.

1. Task intake

The task does not need to start in a bespoke agent dashboard. In practice, the queue is often an existing developer system: GitHub issues, PR comments, Linear tickets, CI events, cron schedules, or agent tabs inside GitHub.

OpenAI Symphony frames Linear as a control plane where every open task can get an agent. GitHub Copilot cloud agent runs asynchronously from GitHub workflows and produces branches, logs, and PRs. The point is not the specific product. The point is that the harness meets developers where work already lives.

2. Driver and model adapter

The driver manages the loop. It passes the task, repository context, instructions, and tool results to the model. The model adapter hides differences between Codex, Claude Code, OpenCode, Amp, Cline, Copilot, or a custom agent.

This adapter layer is still immature across the market. Transcript formats, exit codes, auth models, permission names, and resume behavior vary by vendor. A serious platform should treat the agent CLI as an integration point, not as the whole product.

3. Workspace manager

Workspace design is a first-order product decision. Common options include a local working directory, git worktree, ephemeral cloud checkout, persistent sandbox volume, VM or microVM, containerized development environment, and dedicated branch or PR workspace.

The field is converging on one agent per isolated workspace. Git worktrees, containers, microVMs, and cloud sandboxes all support the same operational goal: parallel work without overlapping filesystem mutations.

4. Tool executor

Coding agents need a wider tool surface than chat agents. They read files, edit files, run shell commands, install packages, invoke compilers, use test runners, call browsers, interact with MCP servers, inspect git state, and sometimes use cloud credentials.

The harness should execute these actions through a controlled tool layer. Direct shell access without containment is an operational shortcut, not a complete runtime strategy.

5. Sandbox and permission policy

Sandboxing and permissions are separate. A permission system decides whether the agent is allowed to attempt an action. A sandbox enforces the technical boundary if the action runs.

OpenAI Codex documentation describes sandbox modes and approval policies, with local network access off by default and write access typically limited to the active workspace. Claude Code documentation explicitly separates permissions from sandboxing: permissions gate tools, files, and domains, while sandboxing enforces restrictions on Bash and child processes unless the whole process runs inside a broader boundary.

This distinction is not academic. Community issues around OpenCode asked for true filesystem sandboxing for shell and subprocesses, separate from tool permissions. The request makes sense: an agent can comply with a tool permission model while a child process still does damage unless the operating environment enforces boundaries.

6. Durable state

Long-running agents cannot rely on the model context window as the system of record. The harness needs external state: current task status, plan artifacts, progress logs, transcript, tool calls, diffs, commits, checkpoints, sandbox snapshots, and review comments.

Anthropic's long-running harness work identifies discrete sessions with no memory as the core problem for multi-hour or multi-day tasks. The proposed answer is structured artifacts: initializer setup, task decomposition, incremental progress files, handoff state, and disciplined continuation across context resets.

7. Feedback and validation

A useful harness makes the repository legible to the agent. Tests, logs, traces, browser automation, screenshots, metrics, docs, and architectural constraints become feedback channels.

OpenAI's harness engineering report describes making repository knowledge, tests, docs, logs, metrics, traces, browser validation, and cleanup loops available to Codex. The lesson is direct: mature harnesses convert vague intent into verifiable software changes through mechanical feedback.

8. Review and handoff

The output should be reviewable. A finished run should produce a clear diff, branch, commit series, draft PR, status note, test result, and enough logs to explain important decisions. Humans should not need to reconstruct the run from raw transcript unless something failed.

GitHub Copilot cloud agent, Codex cloud, and Symphony all point toward the same review model: run the agent in the background, then return a diff or PR with logs and follow-up options.

Sandbox choices are product choices

There is no single sandbox shape that fits every team. The right option depends on risk tolerance, repo setup complexity, performance, cost, and how much autonomy the agent gets.

Sandbox pattern Useful when Main trade-off
Local OS sandbox The agent runs near the developer checkout with limited writes and network controls. OS enforcement details can be hard, especially across platforms.
Git worktree per task The team wants cheap parallelism and clean branch separation. It isolates edits, but not necessarily processes, network, or secrets.
Dev container The repo already has containerized setup and repeatable dependencies. Container boundaries still need policy, credentials, and cleanup.
MicroVM or VM The agent can run with more autonomy and stronger host separation. Provisioning, startup time, and cost need management.
Cloud sandbox The platform needs API-managed environments, snapshots, persistence, or parallel fleets. Network, data retention, and vendor integration become governance questions.

Docker Sandboxes describe isolated microVMs with separate Docker daemon, filesystem, and network. E2B positions its sandboxes as full Linux environments with terminal, filesystem, and git. Daytona documents dedicated kernel, filesystem, network stack, vCPU, RAM, disk, snapshots, and persistent operations. Windmill combines process isolation and persistent volumes around filesystem-based agents.

These products differ, but they agree on the architectural direction: if the agent is going to run code, the runtime boundary has to be real.

Security is runtime engineering, not prompt wording

Instructions help. They are not a boundary. A security-minded harness should be evaluated like production infrastructure.

The minimum bar includes egress control, secrets handling, audit logs, policy as code, least privilege, reproducible setup, isolated MCP and hook execution, branch protections, CI gates, and cleanup of orphaned compute.

Secrets need special care. Codex cloud uses a two-phase pattern where setup can receive setup secrets before the agent phase. That design reflects a broader principle: the package installation phase, repository access phase, and agent execution phase do not need identical credential access.

MCP servers and hooks also deserve scrutiny. They extend the agent's tool surface. A harness that isolates shell commands but lets a plugin reach broad credentials has only moved the weak point.

State and lifecycle decide whether agents scale

The first agent feels like a tool. The tenth agent feels like a distributed system.

At that point, lifecycle work becomes visible. You need queue priority, concurrency limits, retry policy, idle shutdown, budget controls, sandbox snapshots, recovery from failed setup, cancellation, and cleanup. You also need ownership rules so two agents do not attack the same module without coordination.

Temporal's sandbox orchestration examples describe the repeated need for sandbox provisioning, routing execution, persisted state, failure recovery, and cleanup tied to durable workflows. That is the right lens. The agent is one worker inside a larger state machine.

How to evaluate an agent harness for coding agents

For senior developers, platform engineers, and devtool founders, the evaluation should be concrete. Do not start with the prompt. Start with the operating contract.

  • Workspace: Is there one isolated workspace per agent task? How are branches, worktrees, containers, and persistent volumes mapped?
  • Sandbox: What boundary contains shell commands, child processes, package managers, browsers, MCP servers, and network calls?
  • Permissions: Which actions require approval? Can policy differ by repo, task type, environment, or agent?
  • Secrets: Which credentials exist during setup, agent runtime, CI, PR creation, and follow-up?
  • State: What survives a context reset, crash, retry, or handoff to another agent?
  • Validation: Can the agent run the right tests, read logs, inspect traces, and use browser feedback?
  • Review: Does the output arrive as a clear diff or PR with test status and useful run history?
  • Operations: Can you pause, resume, cancel, retry, snapshot, clean up, and cap cost?
  • Audit: Can you explain what the agent did, which tools it called, which files it touched, and which policies applied?
  • Adapter strategy: Can the harness run more than one vendor agent without rewriting the whole control plane?

The practical conclusion

The core risk in operated coding agents is not that a model writes imperfect code. Developers already know how to review imperfect code. The larger risk is that the harness lets the agent work in the wrong place, with the wrong permissions, without durable state, without clean review output, and without a way to recover from failure.

A strong agent harness for coding agents makes the model useful by owning the runtime around it. It gives each task a controlled workspace, runs tools inside real boundaries, preserves state outside the context window, turns telemetry into feedback, and hands humans a reviewable software change.

That is the difference between a local assistant and an operated coding-agent system.

Get started

Deploy your fleet.

Put a fleet of sandboxed agents to work on your own infrastructure, provisioned in seconds and watched live from one console.

Get started

Admin-provisioned · Self-host in one command · Your data never leaves your VM