Self-hosted infrastructure for a fleet of long-lived AI agents. Each agent runs in its own sandboxed container with a persistent workspace, internet access, and a pluggable driver.

Yes. Single-VM deploy on plain Docker with your own keys and data. No vendor lock-in, no per-seat SaaS tax, nothing leaves your stack.

Which agent engines (drivers) are supported?

One identical API across swappable brains: Vanilla, Opencode, Codex, and Claude Code. The driver is part of an agent's config and can be swapped anytime.

How do you keep my data and agents secure?

Every agent runs in a hardened container: read-only root filesystem, dropped capabilities, network egress filtering, and CPU/memory limits. LLM keys are stored server-side, encrypted, and never sent to the browser.

Can I watch what an agent is doing?

Yes. Every task streams over Server-Sent Events (assistant messages, tool calls, results, file changes, a live token meter), and reconnects resume where they left off.

Accounts are provisioned by admins; you self-host the stack. Use Get started, or sign in if you already have an account.

Coding Agent Cost Monitoring: Control Spend Early

For platform leaders, the practical answer is simple: coding agent cost monitoring is an observability and policy problem first, and a billing dashboard problem second. Provider dashboards tell you who spent money and which model was used. They rarely tell you which repository, pull request, retry loop, subagent, or workflow created the bill.

If your team is rolling out Codex, Claude Code, GitHub Copilot, or several tools at once, you need a cost model that connects usage to engineering outcomes. That means tracking tokens, credits, sessions, pull requests, retries, and agent modes together.

Why coding agent cost monitoring became a platform problem

Traditional developer tooling was easy to budget: seats, licenses, renewals. Coding agents change that pattern. A short chat, a long codebase exploration, a high reasoning task, and a multi-agent background run can all sit under the same product label while consuming very different amounts of budget.

The cost unit is converging around tokens, but every provider wraps that unit differently. OpenAI exposes API usage, project budgets, and Codex plan usage limits. Anthropic exposes spend limits, usage tiers, cost reporting, and OpenTelemetry metrics for Claude Code. GitHub converts input, output, and cached tokens into AI Credits, where 1 credit equals $0.01.

That makes comparison harder than it looks. The management question is not only "How much did we spend?" It is "What useful engineering result did that spend buy?"

The billing primitives you need to normalize

Before you can manage coding agent spend, define the units your internal reports will use. Provider language differs, but most teams need a common reporting layer across these primitives:

Tokens: input, output, cached tokens, and long-context usage.
Credits: product-specific units such as GitHub AI Credits or Codex plan usage.
Dollars: provider invoices, Anthropic spend limits, and marketplace charges.
Workflow cost: GitHub Actions minutes for Copilot code review on private repositories from June 1, 2026.
Engineering output: pull requests, commits, lines changed, accepted reviews, reverted changes, and completed tasks.

Once these units are normalized, you can stop arguing about isolated token counts and start measuring spend per outcome.

What the provider dashboards can and cannot tell you

OpenAI API projects can scope usage, model access, API keys, and monthly budgets. The important detail is that project budgets are soft thresholds. They alert, but API requests continue after the budget is exceeded. The OpenAI Usage Dashboard can show organization and project data, but access depends on owner or granted dashboard permissions, and it does not consolidate usage across sub-organizations.

Codex under ChatGPT plans counts toward an agentic usage limit. Larger codebases, long-running tasks, and extended sessions consume more per message, so a plan-based rollout still needs usage governance.

Anthropic gives Claude Code several useful cost surfaces: workspace spend limits, Console reporting, usage credits, and centralized organization usage. Claude Code also exports OpenTelemetry metrics such as session count, lines changed, pull request count, commit count, estimated USD cost, and token usage. Its telemetry can attribute cost to users, teams, skills, plugins, and subagent types.

GitHub Copilot Business and Enterprise pool AI Credits at the billing-entity level. Business includes 1,900 credits per user per month, while Enterprise includes 3,900. GitHub also supports budget controls at user, organization, cost-center, and enterprise levels. User-level budgets can hard-stop a single user during shared-pool and metered phases.

The attribution gap: repo, PR, developer, model, subagent

The main weakness in provider dashboards is attribution. Organization, project, user, and model are useful dimensions, but platform teams usually need finer answers:

Which repository created the spend?
Which pull request or issue did the agent work on?
Was the spend from chat, code review, cloud agent work, CLI work, or a background task?
Which model, reasoning level, or context size drove the cost?
How much came from retries, failed runs, recursive subagents, or auxiliary work?

Claude Code telemetry moves closer to this model because it includes query source values such as main, subagent, and auxiliary, plus agent names. GitHub cost centers help with allocation. OpenAI project structure can help separate teams or applications. Even so, repo-level and pull-request-level causality often requires joining billing exports, agent telemetry, version control metadata, CI runs, and review outcomes.

Runaway cost patterns to watch first

The highest-risk patterns are not mysterious. They show up when agents are allowed to keep working without clear boundaries.

Pattern	Cost signal	Control to add
Long codebase exploration	Large input tokens, long sessions, repeated file reads	Per-task token budgets and context-size policy
High reasoning on routine work	Higher per-message usage without better outcome quality	Model and reasoning defaults by task type
Subagent fan-out	Sudden quota burn across main and subagent work	Subagent count caps and subagent-level telemetry
Retry loops	Repeated similar runs with low code change value	Retry backoff, failure classification, and stop rules
Stale or inaccurate counters	Local usage display differs from logs or billing export	Billing reconciliation against provider exports

Field reports in the brief point to the same weak spots: subagent visibility, broad research prompts without stopping signals, recursive spawning, hung sessions, stale counters, and quota surprises from concurrent subagents. Treat the anecdotes as failure modes, not statistics.

A practical guardrail architecture

A mature setup has four layers: provider controls, telemetry, gateway policy, and outcome reporting.

1. Provider controls

Use the controls each vendor gives you. Set OpenAI project budgets even though they are soft alerts. Use Anthropic spend limits and rate limits where available. Configure GitHub budgets at the user, organization, cost-center, and enterprise levels. Avoid a single shared pool with no user caps, because GitHub explicitly warns that one heavy user or automated agent session can consume a disproportionate share early in the billing cycle.

2. Telemetry

Collect agent-level events with enough detail to explain cost. At minimum, capture user, team, repository, branch, pull request, model, task type, session ID, token usage, retry count, subagent type, and outcome. For Claude Code, OpenTelemetry gives a strong starting point. For other systems, you may need wrapper scripts, gateway logs, or CI annotations.

3. Gateway policy

Gateway patterns matter because billing dashboards rarely match your internal allocation model. An LLM gateway can enforce centralized budgets, rate limits, audit logging, provider routing, and model policy. It can also attach your own metadata before traffic reaches the provider.

4. Outcome reporting

Token totals do not tell leadership whether the rollout is working. Join usage with engineering outcomes: accepted pull requests, merged changes, reverted changes, review comments accepted, incidents avoided, and failed tasks. This is where cost monitoring becomes management information instead of invoice explanation.

Metrics platform leaders should track

Start with a small set of metrics that can drive decisions. Too many dashboards create noise; too few hide waste.

Cost per accepted PR: agent spend divided by pull requests merged or accepted.
Cost per reverted change: spend attached to work that later required rollback.
Cost per developer: user-level spend, including chat, code review, and agent sessions.
Tokens per session: useful for detecting long-context drift.
Subagent share: percentage of spend from subagents or auxiliary work.
Retry cost: spend consumed after the first failed attempt.
Review workflow cost: AI Credits plus Actions minutes where Copilot code review runs on private repositories.
Budget burn rate: usage pace compared with the monthly credit or spend pool.

Governance checklist for coding agent cost monitoring

Define internal cost units that normalize tokens, credits, dollars, and workflow minutes.
Set default budgets by team, user, and environment before broad rollout.
Use hard stops where providers support them, and treat soft budgets as alerts only.
Attach repository, pull request, workflow, and task metadata to every agent run.
Set model and reasoning defaults by task category.
Cap subagent count, recursion depth, and retry attempts.
Alert on abnormal session length, token spikes, and fast budget burn.
Reconcile local counters with provider billing exports.
Report cost per outcome, not only total spend.
Create an override process for expensive tasks that are justified by business value.

Buying criteria for observability and gateway tooling

If you are evaluating tooling, do not stop at "shows token usage." That is table stakes. The useful questions are more specific:

Can it attribute spend to repository, pull request, workflow, developer, model, and agent mode?
Can it separate main agent, subagent, and auxiliary work?
Can it enforce hard limits before a runaway task consumes the pool?
Can it join billing data with CI, version control, and review outcomes?
Can it normalize OpenAI usage, Anthropic spend, GitHub AI Credits, and marketplace charges?
Can finance use it for chargeback or showback without manual spreadsheet work?

The strategic trade-off is clear. If coding agents remain a small experiment, provider dashboards may be enough. Once they become part of daily engineering workflows, the dashboard has to answer operational questions, not only billing questions.

The practical standard is cost per useful engineering result. Track tokens and credits because they explain the bill. Track repositories, pull requests, retries, subagents, and accepted work because they explain whether the bill was worth paying.