AI agent observability for safe coding agent rollouts
AI agent observability for safe coding agent rollouts
When you roll out coding agents, the question is not only whether the model answered well. The operational question is what the agent did to your repository, runtime, credentials, CI system, and review workflow.
That is the core of AI agent observability for coding agents. You need a run-level record that shows what the agent was asked to do, what context it saw, which model calls it made, which tools it used, which commands it ran, what files changed, what tests ran, what failed, what it retried, what it cost, and who approved the result.
General LLM tracing is useful, but it is not enough. A coding agent can mutate source code, install dependencies, call MCP servers, open pull requests, and run inside environments that may expose secrets. Observability has to cover those side effects directly.
Start with the run record
The minimum useful unit is an agent_run trace. Treat it as the root span for the whole session, not as a wrapper around a single model call.
Under that root span, capture child spans for planning, LLM calls, tool calls, file operations, shell commands, MCP or API calls, retries, evaluations, tests, security scans, pull request creation, and human review. This gives platform teams a single tree that connects intent, execution, evidence, and approval.
For coding agents, the trace tree should include separate span types for bash_command, file_read, file_write, patch_apply, git_diff, test_run, lint_run, dependency_install, network_call, secret_scan, code_scan, commit, and pull_request.
This is where coding agent observability differs from standard LLM tracing. Standard tracing can tell you what happened inside the model call. Coding agent observability tells you what happened to the repo and the delivery system because the agent was allowed to act.
Instrument model calls, but do not stop there
For each LLM span, record the provider, model, prompt or template version, prompt fingerprints, context or document IDs, input tokens, output tokens, reasoning tokens, cached tokens, latency, status, stop reason, error, guardrail result, cost estimate, and sampling decision.
These fields line up with the direction of current tracing systems. OpenTelemetry GenAI work focuses on model, token, and content capture conventions. OpenInference defines span kinds for LLMs, agents, chains, retrievers, rerankers, and tools. OpenAI Agents SDK tracing covers LLM generations, tool calls, handoffs, guardrails, and custom events. Langfuse and LangSmith expose trace trees with prompts, responses, token usage, costs, latency, tools, retrieval steps, metadata, and evaluation data.
The practical point is simple: use those standards where they fit, then add coding-specific evidence around them. A trace that knows the model and token count but not the resulting diff is incomplete for a coding agent.
Capture tool calls as operational events
Every tool span should carry the tool name, normalized action type, caller agent, redacted argument digest or approved arguments, start time, end time, result status, retry count, output digest, truncated output policy, side effects, approval source, and policy decision.
This matters because tool calls are where intent becomes action. A model suggestion is low-risk until it becomes a shell command, file write, dependency install, network request, or pull request.
For coding agents, classify tool actions by side effect. Reading a file, writing a file, deleting a file, changing a lockfile, editing a workflow, calling an external domain, and opening a PR should not look the same in your telemetry. They carry different review and alert requirements.
Workspace evidence is not optional
A coding agent run should leave durable workspace evidence. Capture the repo URL, base SHA, worktree status before and after the run, touched files, patch or diff, deleted files, generated files, dependency lockfile changes, workflow changes, generated test output, and final PR URL.
Do not rely only on final commits. Agents can fail mid-task, lose context, abandon uncommitted work, or change files that never reach a pull request. The observable unit is the session, not only the final branch.
Command logs need similar care. Store the command, current working directory, shell, exit code, duration, stdout and stderr truncation policy, environment allowlist, denied environment values, network destination summary, and whether the command was user-approved, policy-approved, or auto-run.
The exact command output may include secrets, proprietary code, or customer data. That does not mean you skip command observability. It means you make capture tiered, redacted, access-controlled, and explicit.
Default to metadata, hash sensitive content
Prompt and context capture should be tiered. The default should be metadata, lengths, hashes, prompt IDs, template IDs, and context document IDs. Full prompt, response, file content, tool content, Bash output, and raw API body capture should be opt-in, time-limited, redacted, and visible to users.
Claude Code documentation is a useful example of this split: raw file contents and code snippets are not included in metrics or events by default, while prompt, response, tool detail, tool content, and raw API body capture are explicit controls. The same docs note that tool content capture can include raw Read results and Bash output, with truncation when enabled.
That design principle should apply beyond one product. Platform teams should treat full content capture as an incident or debugging mode, not as the ordinary operating mode for every developer session.
There is also an adoption issue. Community discussions around agent telemetry show two competing needs: engineers want queryable traces for failed tools and orchestration problems, while developers worry about per-user monitoring. Use team, repo, and process metrics by default. Reserve named-user views for security, compliance, and incident workflows.
Meter cost by session, PR, team, and org
Cost observability should work at model-call, run, session, pull request, repo, team, and organization levels. Track input tokens, output tokens, cached tokens, reasoning tokens, tool or provider charges, retry waste, and runaway p99 sessions.
Token and cost dashboards are already common in LangSmith and Langfuse. For coding agents, make the unit of analysis more operational. A cheap run that opens a risky pull request is not better than an expensive run that produces a tested, reviewed, low-risk fix. Cost data needs to sit beside diff risk, test evidence, review outcome, and failure class.
Watch retry waste closely. Repeated retries without diff progress are a sign of context failure, tool mismatch, environment breakage, or a model stuck in a loop. That should alert before it turns into a budget issue.
Attach evaluations to traces
For coding agents, evaluation should not be a separate spreadsheet after the fact. Attach evals to the run trace.
Useful eval fields include task success, tests passed, diff risk, code-review findings, security scan status, style or maintainability score, tool misuse, groundedness against the issue requirements, and human override outcome.
This turns failures into regression material. If a session failed because the agent misread the issue, picked the wrong file, ignored a failing test, or edited a GitHub Actions workflow without approval, you need that failure class attached to the trace.
Use operational failure classes that engineering teams can act on: model reasoning failure, context retrieval failure, tool or API failure, shell or environment failure, dependency failure, test failure, merge conflict, policy block, permission block, security finding, cost or time budget breach, telemetry gap, user-review rejection, and interrupted run.
Replay is forensic, not deterministic
Replay is valuable, but exact replay is rarely guaranteed. Model versions shift, provider behavior can change, external APIs move, dependency registries update, and tool outputs can be nondeterministic.
If you want useful reconstruction, store the repository snapshot or SHA, prompt and context identifiers, model and version, tool registry versions, external API response summaries, command outputs, environment image, and nondeterminism controls where available.
Set the expectation correctly: replay is forensic reconstruction plus selective rerun. It is not a promise that the agent will produce the same path and same diff every time.
Alert on risky side effects, not every noisy step
Agents make many benign calls. Alerting on every tool use will train people to ignore the system. Alert on events that change the risk profile of the run.
High-signal alert triggers include runaway spend, repeated retries without diff progress, denied tool loops, commands touching secrets or credentials, workflow or CI configuration edits, network calls to unknown domains, unexpected dependency installs, attempts to bypass policy, missing token or cost data, missing nested tool spans, and PRs opened without required evidence.
Security research from GMO Flatt and Microsoft shows why this matters. Coding agents in CI/CD can process untrusted GitHub issues or pull requests while holding access to file-read tools, workflow permissions, secrets, or external communication paths. Treat them as privileged automation exposed to adversarial text, not as ordinary developer assistants.
GitHub Copilot coding agent documentation and product posts point to one operational pattern: session logs, commits, pull requests, branch protections, and human approval before CI/CD workflows run. Agent self-review and security scans can help, but they do not replace human code ownership or branch protection.
Build an audit trail reviewers can follow
The audit trail should connect the task or issue, agent session, trace ID, base commit, commits, pull request, reviewer decisions, CI runs, tool approvals, policy versions, and redaction or content-capture settings.
This is the evidence chain a reviewer needs when an agent-authored PR behaves strangely. It is also what security teams need when a workflow edit, secret scan finding, unknown network call, or policy bypass attempt appears after the fact.
GitHub Copilot docs state that session logs show reasoning, tools used, repository understanding, code changes, and validation, with commits linked back to session logs. That is the right direction: reviewers should be able to move from code diff to agent session without guessing what happened.
Open issues in the ecosystem show why you should test your telemetry itself. Public reports have described missing nested subagent or MCP tool spans in integrations, and content-capture settings that changed log emission behavior in specific Claude Code telemetry setups. Treat telemetry pipelines as production infrastructure. Add tests that prove nested spans, tool results, redaction, and cost fields arrive as expected.
A practical AI agent observability pattern
The strongest pattern today is OpenTelemetry or OpenInference-compatible tracing for common LLM and agent spans, plus a coding-agent schema for repository and machine side effects.
In practice, that means one trace per run, consistent span names, stable IDs for task, session, repo, PR, and commit, and redaction rules that are enforced before data leaves the runner.
Keep content capture off by default. Store hashes and digests for correlation. Allow short-lived full capture for debugging and incidents. Make the destination visible when enterprise-managed telemetry sends prompts, commands, or tool results to a company endpoint.
Then build dashboards around operational questions:
Which agent runs changed code, dependencies, workflow files, or secrets-related files?
Which sessions retried without making progress?
Which pull requests lack test output, security scan status, or reviewer approval?
Which teams or repos are seeing p99 cost spikes?
Which failure classes repeat across tools, models, or environments?
Which telemetry fields are missing from production traces?
Those questions are more useful than vanity metrics. Accepted lines of code, cost per line, or individual developer AI usage can mislead managers if they are separated from diff risk, review outcome, and test evidence.
Rollout checklist for platform teams
Create an
agent_runroot span for every session.
Record LLM spans with model, prompt fingerprint, token counts, latency, guardrail result, error, and cost estimate.
Record tool spans with action type, result, retry count, side effects, approval source, and policy decision.
Capture repo URL, base SHA, worktree status before and after, touched files, patch, deleted files, generated files, test output, and PR URL.
Log shell commands with cwd, shell, exit code, duration, truncation policy, environment allowlist, network destination summary, and approval mode.
Keep prompt, response, tool content, Bash output, and raw API body capture opt-in with redaction, retention limits, access controls, and user notice.
Meter cost at model-call, run, session, PR, repo, team, and org levels.
Attach evals and failure classes to traces.
Alert on risky side effects: secrets, workflow edits, unknown network calls, dependency installs, policy bypass attempts, runaway spend, and missing telemetry.
Require human review for code, workflow, secret, permission, and deployment-impacting changes.
Connect task, trace, base commit, commits, PR, CI run, reviewer decision, tool approval, and policy version in one audit trail.
The operating standard
AI agent observability for coding agents should answer one hard question: can you explain, review, and govern everything the agent did before its work reaches production?
If the answer depends only on prompt logs, you are under-instrumented. The operating standard is a complete run record: model calls, tool calls, commands, diffs, tests, scans, cost, failures, approvals, and audit links. That is what lets platform teams scale coding agents without giving up control of the repo, CI system, or review process.