← All posts Engineering

Long-running background AI agents need durable workers

Long-running background AI agents should be designed as durable workers, not as always-on chat sessions. That is the architecture decision that determines almost everything else: reliability, cost, security, review, and whether a platform team can debug the system after a bad run.

The useful mental model is not "keep the model thinking until it finishes." It is "create a durable job, persist state after each meaningful step, wake workers when there is work, pause when the system needs a human, and produce a reviewable artifact at the end."

That distinction matters for platform engineers and devtool founders because the category is moving from demo chatbots into real software delivery workflows. GitHub Copilot cloud agent can work in the background in an ephemeral GitHub Actions environment, create branches and pull requests, and run tests. OpenAI Codex runs tasks independently in isolated repository-loaded environments and returns logs, citations, and changes for review. OpenAI Background mode exposes the same pattern at API level: run work asynchronously, poll status by ID, and cancel when needed.

Those are not long HTTP requests. They are task systems.

The decision: durable worker or chat loop

You have two basic implementation paths.

Choice What it optimizes for Where it breaks
Always-on chat loop Interactive exploration, user-attended sessions, fast iteration Long waits, retries, lost connections, unclear side effects, poor auditability
Durable worker Schedules, queues, repo events, resumable state, approval gates, review artifacts Requires explicit state design, idempotency, cost budgets, and operational controls

If the task can run after the user leaves, survive process restarts, wait for CI or approval, and create a branch, PR, report, or ticket update, it belongs in the second column. Long-running background AI agents need durable job identity, persisted state, status, cancellation, bounded execution units, and a terminal artifact. Without those pieces, the product may look autonomous but operate like an unreliable script.

Cloudflare's long-running agent docs make the same architectural point in different terms: durable state, schedules, SQL data, and fiber checkpoints can survive hibernation or restarts, while in-memory variables, timers, open fetches, and closures do not. The agent should wake, work, and hibernate. It should not keep compute alive while waiting for a webhook, a person, or a scheduled retry.

The practical center is repo maintenance

The credible first market is not open-ended software engineering. It is bounded repo maintenance and workflow automation.

Good background coding agents handle tasks such as nightly failing-test repair, dependency update triage, issue labeling, codebase scans, release-note drafts, security alert first-pass fixes, documentation updates, test generation, and low-to-medium-complexity pull requests. GitHub Copilot automations explicitly support recurring schedules and repo-event triggers. Their examples include nightly failing-test fixes and weekly release-note PRs.

That is the right level of ambition. These tasks have known entry points, bounded outputs, existing review surfaces, and often a natural artifact: a pull request, a report, a label change, or a comment. They are still hard, but they are easier to constrain than "improve the codebase" or "handle customer operations."

The field evidence points in the same direction. In a Hacker News discussion of GitHub Copilot Coding Agent, enthusiasm centered on asynchronous PR output, while skepticism focused on missing denominator metrics such as failed runs and human takeovers. A Reddit ExperiencedDevs thread showed the same buyer pull: developers want an agent that monitors GitHub issues, creates a branch and PR, then pings for review. The reported friction was not desire. It was setup burden, monitoring, restarting, and review effort.

The implication is direct: your first product surface should probably look less like a chatbot and more like a job dashboard tied to GitHub, CI, issue trackers, and approval flows.

Runtime architecture for long-running background AI agents

A production background agent runtime has seven core parts:

  • Trigger: schedule, queue message, webhook, issue assignment, PR comment, security alert, or explicit user delegation.

  • Durable job record: task ID, tenant, repo, branch, status, requested goal, budget, permissions, and cancellation state.

  • Queue and scheduler: concurrency limits, rate limits, priorities, dedupe keys, retry policy, and tenant fairness.

  • Sandbox: isolated execution environment with controlled filesystem writes, secrets, network access, and tool allowlists.

  • Model and tool loop: bounded steps that inspect code, call tools, edit files, run tests, and update structured state.

  • Human gate: persistent pause state for approval, rejection, edited arguments, timeout, and audit logging.

  • Review artifact: PR, diff, logs, test output, plan, trace, report, or release note draft.

Temporal, Inngest, Cloudflare Agents and Workflows, Trigger.dev, Hatchet, and LangGraph persistence each supply different parts of that model. Temporal contributes workflow and activity separation, task queues, retries, timeouts, signals, and queries. Inngest exposes durable steps, cached step results, sleep, wait-for-event, flow control, idempotency, and observability. Trigger.dev positions scheduled and long-running background tasks with realtime status, waits, retries, and observability. Hatchet provides queues, retries, monitoring, alerting, logging, durability, and checkpointing. LangGraph contributes graph state persistence, checkpointers, interrupts, and resume semantics.

The platform choice is less important than the contract: every long-running unit needs a durable status record, every external side effect needs protection against duplication, and every human-facing result needs evidence that a reviewer can inspect.

Queue semantics decide whether costs stay bounded

A plain queue is not enough. It can ensure that a worker eventually picks up a message, but agent work has different failure modes. It is slow, bursty, expensive, and often composed of nested tool calls.

For long-running background AI agents, the queue layer needs:

  • Dedupe and idempotency keys for each run and each side-effecting step.

  • Per-step retries instead of only retrying the entire job.

  • Global retry budgets so SDK retries, workflow retries, queue retries, CI retries, and provider retries do not multiply.

  • Concurrency limits by tenant, repository, tool, and provider.

  • Rate limits and backoff with jitter for model calls, GitHub APIs, CI, and package registries.

  • Priority and fairness controls so one tenant cannot saturate the fleet.

  • Cancellation that propagates to workers, tool calls, and pending approvals.

  • Visible run state for users and operators.

This is where many prototypes fail. A retry that is harmless for a pure function can be expensive or dangerous for an agent. Re-running a code search is fine. Re-running a package publish, issue mutation, external API call, or branch push needs an idempotency key, a preflight check, or an explicit human gate.

A public LangGraph Cloud issue reported long tool calls being re-dispatched from checkpoints, creating duplicate work and cost. Treat that as a design warning, not a verdict on one framework. Checkpoints help resume state. They do not automatically give you heartbeats, failure detection, side-effect safety, or cost containment.

Checkpointing is necessary, not sufficient

Checkpointing solves one part of the problem: preserving state across interruptions. LangGraph checkpointers persist thread-scoped graph state. Interrupts can pause graph execution and resume later with a command and thread ID. Hatchet durable tasks write checkpoints to an event log and replay from checkpoints. Inngest's durable steps cache completed step results and retry independently.

But checkpointing is not the same as durable execution. You still need supervision, timeout policy, retry policy, queue management, heartbeat behavior, cancellation, and side-effect rules.

Temporal's LangGraph integration is useful because it makes the boundary explicit. Workflow code must be deterministic. I/O, LLM calls, long-running work, and failure-prone operations should be modeled as Activities with retries and timeouts. That is a clean principle for agent infrastructure even if you are not using Temporal: keep orchestration deterministic where possible, push uncertain work into bounded activities, and make each activity observable.

The same rule applies to context management. Do not rely on replaying a giant conversation after a six-hour wait. Persist the plan, current branch, tool outputs, test evidence, pending decisions, and compact continuation state. The agent should be able to resume from structured state, not from an implicit chat transcript.

Human-in-the-loop is infrastructure

Human-in-the-loop agents are often described as if approval were just a button. In production, approval is a state machine.

A real approval gate includes who requested the action, what the agent wants to do, why it believes the action is needed, what arguments or diff will be applied, who approved it, whether the approver edited the arguments, how validation ran, when the approval expires, and what happens on timeout.

LangChain's human-in-the-loop guidance recommends persisting interrupt state, logging decisions, validating edited arguments, and setting timeouts. GitHub Copilot automations preserve review controls by requiring approval before Actions workflows run on agent-created PRs. Cloudflare's human-in-the-loop docs separate workflow approvals from MCP elicitation, which is a useful product distinction: sometimes a person approves an operation, and sometimes the agent needs missing input.

For platform builders, the product surface should make the pending state obvious. A reviewer should see the goal, plan, diff, tests, tool calls, cost so far, permission being requested, and the consequence of approving. If that data is spread across logs, chat messages, and CI output, review becomes guesswork.

Sandboxing is the baseline control

A background agent should start with less authority than a developer. It can earn temporary authority through scoped tools and approval gates.

OpenAI Codex runs tasks in isolated environments preloaded with the repository. GitHub Copilot cloud agent uses an ephemeral GitHub Actions-powered environment. OpenAI's Windows sandbox work highlights why OS-enforced constraints matter: useful coding agents need controlled filesystem writes and network limits.

For a platform team, the minimum control set is concrete:

  • Repository-scoped credentials and least-privilege GitHub permissions.

  • Tool allowlists by task type and tenant.

  • Network egress allowlists, especially for code and secret-bearing environments.

  • Secret injection only when required, with redaction in logs and traces.

  • Filesystem boundaries and cleanup policy for workspaces.

  • Max duration, max attempts, max tokens or credits, and max CI minutes.

  • Retention policy for prompts, traces, code context, logs, and generated artifacts.

Scheduled agents and repo-event automations add another risk: untrusted input can trigger work. GitHub's default behavior of ignoring events from users without write access is an important mitigation pattern. If a stranger can open an issue that starts a privileged agent run, your prompt-injection threat model is already active.

Observability must include model actions

Traditional job observability is necessary but incomplete. You still need job status, worker health, retries, queue depth, latency, error rates, and alerts. But agent failures are often semantic. The process may succeed while the work is wrong.

That means traces should include model inputs and outputs where policy allows, tool calls, tool arguments, command output, test results, intermediate plans, approval decisions, branch and commit IDs, cost, token usage, retries, and skipped steps. For coding agents, test output citations and PR-ready diffs are not nice-to-have details. They are the review interface.

OpenAI's Codex announcement emphasizes task logs and test citations. Trigger.dev, Hatchet, Inngest, Cloudflare, and Temporal all position observability as part of the durable runtime. The shared lesson is simple: if an operator cannot reconstruct why the agent touched a file, opened a PR, or asked for approval, the system is not production-ready.

Build versus buy

The build-versus-buy decision depends on where you want differentiation.

Scenario Likely choice Reason
You need async coding help inside GitHub with PR review as the artifact Productized coding agent GitHub Copilot cloud agent or OpenAI Codex already package sandboxing, repo context, logs, tests, and PR workflows.
You are building agentic workflows into your SaaS product Durable execution platform plus narrow LLM steps Temporal, Inngest, Trigger.dev, Hatchet, Cloudflare Workflows, or similar runtimes handle schedules, queues, retries, waits, and status.
You need custom graph state, interrupts, and agent memory LangGraph plus a durable runtime Checkpointers and interrupts help with state, but long-running and side-effecting work still needs durable execution controls.
You are differentiating on infrastructure itself Build the control plane The core product is scheduling, sandboxing, governance, observability, artifact review, and integration depth.

Anthropic's guidance to start with simple composable patterns before increasing autonomy is a good default. The more deterministic the task, the more it should look like a workflow with narrow model calls. Use autonomy where it changes the outcome: exploring an unfamiliar code path, choosing a repair strategy, drafting a migration, summarizing test failures, or deciding which files need edits.

The less deterministic the task, the more the platform must constrain permissions, budget, and review.

When not to use background agents

Long-running background AI agents are a poor fit for vague, high-blast-radius work. They are also a poor fit when the codebase has weak tests, the action is irreversible, the security surface is broad, the approval process is unclear, or the organization cannot tolerate stored prompts, traces, and code context.

OpenAI Background mode requires stored state and is not compatible with Zero Data Retention. GitHub notes that cloud agent sessions consume GitHub Actions minutes and AI credits. Those are not footnotes. They are platform requirements: budget, retention, tenant isolation, and audit policy have to be designed before scale.

The wrong task is "go improve this service." The right task is "on this schedule, inspect these failing tests, propose the smallest fix on a branch, run this test suite, and open a PR that requires review before Actions run." The difference is not only prompt quality. It is architecture.

The operating principle

Treat the agent as a durable worker with a model inside it. Give it a task ID, a queue, a sandbox, scoped tools, persisted state, retry budgets, approval gates, logs, traces, cost caps, and an artifact that a person can review.

That design is less glamorous than an always-on autonomous loop. It is also the version that can survive CI delays, provider failures, human review, process restarts, and cost controls. For platform engineers and devtool founders, that is the decision that separates a demo from infrastructure.

Get started

Deploy your fleet.

Put a fleet of sandboxed agents to work on your own infrastructure, provisioned in seconds and watched live from one console.

Get started

Admin-provisioned · Self-host in one command · Your data never leaves your VM