Token Budget
A limit on how many tokens a request, task, or agent is allowed to consume.
What Is a Token Budget
A token budget is a limit set on the number of tokens that a task, request, conversation, or agent is allowed to consume, whether as input, output, or both combined. It is used both to control the cost of running a large language model and to keep usage within a model's context window.
How It's Used
Token budgets can be applied at several levels. A single API request can specify a maximum number of output tokens the model may generate, which caps both response length and cost per call. An application can also track cumulative token usage across a session or task and stop or intervene once a set threshold is reached. In systems with multiple long-running agents, a token budget may be assigned per agent or per task to prevent a single runaway process, such as an agent stuck in a repetitive tool-calling loop, from consuming disproportionate resources or cost.
Why It Matters
Most model providers bill by token, counting both the tokens sent as input and the tokens generated as output, so token consumption maps directly to operating cost. Without an enforced budget, an unexpectedly long conversation, an agent stuck retrying a failing step, or a request that pulls in excessive retrieved content can produce a much larger bill than anticipated. A token budget also acts as a safeguard against exceeding a model's context window, since a request that would otherwise overflow the window can be caught, trimmed, or rejected before it fails at the API level.
Setting an Effective Budget
- Per-request limits: capping output length for a single call, useful for keeping responses concise and predictable.
- Per-task limits: capping total tokens across all the steps of a multi-step task, useful for agents that call tools repeatedly.
- Per-user or per-account limits: capping usage over a billing period to control overall spend.
Choosing a token budget generally involves a tradeoff: a tight budget reduces cost and risk but can cut off a task before it finishes, while a generous budget gives a model or agent more room to complete complex work at the cost of higher potential spend. Platforms that run many concurrent agents typically expose token budgets as a configurable setting rather than a fixed value, since appropriate limits vary by task complexity and the model in use.