"The bill was $40k this month." That sentence has ended more agent projects than any technical problem. Token budget controls are the thing standing between your roadmap and that conversation with finance.
Why budgets, not just monitoring
Monitoring tells you what happened. Budgets prevent it. The difference is whether the model can spend $5,000 on one runaway task before any alert fires.
This article is the enforcement companion to AI agent cost optimization. Optimization makes things cheaper on average; budgets cap the worst case.
The four budget tiers
Stack them in order. Each catches what the lower tier missed.
Tier 1: per-call
Every individual model call has an explicit max_tokens and a server-side hard cap. No call can exceed it.
def call_model(messages, *, max_tokens=4000):
if max_tokens > HARD_CAP_PER_CALL: # e.g. 8000
raise BudgetError("max_tokens exceeds hard cap")
return client.messages.create(
model=MODEL_ID,
messages=messages,
max_tokens=max_tokens,
)
Catches: typo in max_tokens, model output running away.
Tier 2: per-task
A task (one user request, possibly involving many calls) has a token budget. The agent loop tracks consumption and aborts when exceeded.
class TaskBudget:
def __init__(self, limit):
self.limit = limit
self.spent = 0
def charge(self, tokens):
self.spent += tokens
if self.spent > self.limit:
raise BudgetExceeded(f"task limit {self.limit} exceeded")
Wire it into every model call. Default budget by task class:
| Task class | Default budget |
|---|---|
| Quick Q&A | 5k tokens |
| Code generation | 50k tokens |
| Research / Deep agent | 200k tokens |
| Bulk batch | per-batch cap |
Catches: agent loops, retries spiralling, oversized retrieval contexts.
Tier 3: per-user / per-tenant
Aggregate across tasks. Each tenant has a daily / monthly cap.
def check_tenant_budget(tenant_id, requested_tokens):
spent_today = budget_store.get(tenant_id, today())
if spent_today + requested_tokens > tenant_limit(tenant_id):
raise TenantBudgetExceeded()
budget_store.incr(tenant_id, today(), requested_tokens)
Most importantly: this catches the case where a single tenant's bug or abuse pattern would otherwise consume the entire org's budget.
For multi-tenant SaaS: bill the tenant for usage above the included tier, throttle below. For internal tools: allocate budget per team and let them own their consumption.
Tier 4: org-wide kill switch
A single number, monitored in real time. If the org's hourly burn rate exceeds N× normal, every agent gets a soft-stop signal. Manual override needed to resume.
This is the equivalent of a circuit breaker on a power line. Use it.
Counting accurately
Budget control fails on bad accounting. Three details:
- Count both input and output. Input tokens are usually the bigger half but get forgotten because they are not "generated."
- Include cache writes at full price. First call to a cached prefix costs 1.25× normal input. Track separately.
- Tools cost too. External APIs, embedding calls, retrieval queries — convert to token-equivalent via a documented rate. Otherwise budgets miss half the spend.
Use the model provider's reported usage from the response, not estimated counts. Discrepancies between estimate and actual indicate something is broken.
Pre-call admission control
For latency-sensitive systems, check the budget before the call, not after:
def admit(task_id, tenant_id, estimated_tokens):
if not task_budget(task_id).has_room(estimated_tokens):
return reject("task budget would exceed")
if not tenant_budget(tenant_id).has_room(estimated_tokens):
return reject("tenant budget would exceed")
return accept()
Estimation accuracy matters: too pessimistic blocks legitimate work; too optimistic misses budget breaches. Use historical task statistics, not constants.
What to do when a budget hits
Three options, picked by task class:
| Strategy | When |
|---|---|
| Hard abort | runaway-loop suspected, batch jobs |
| Graceful degrade | switch to cheaper model and retry once |
| Ask the user | interactive: "this will exceed your budget — proceed?" |
Each requires different plumbing. Hard abort is the easiest to ship; graceful degrade is the most useful in practice.
Observability you need
Budgets without dashboards are theatre. Per dashboard:
- Realtime burn rate in tokens per minute, per tenant.
- Top-N tasks by token spend in the last hour.
- Budget violations count, by tier, by tenant.
- Forecast to month end based on current burn vs. month budget.
Pair with agent token usage analytics for analytical views and real-time agent monitoring for live alerting.
Anti-patterns
- Soft warnings only. "We logged a warning" — the warning was ignored. Hard limits or none.
- Budgets as suggestions. If breach has no enforcement, the budget is documentation.
- Per-API-key budgets only. Misses the case where a logical tenant uses many keys.
- No exemption process. Big customers occasionally need temporary increases. Bake the process in or someone will disable budgets entirely.
A starter implementation in 100 lines
The smallest useful budget control is:
class Budgets:
def __init__(self, store):
self.store = store
def check_and_charge(self, scope, key, tokens, limit):
spent = self.store.incr(f"{scope}:{key}", tokens, ttl=86400)
if spent > limit:
self.store.decr(f"{scope}:{key}", tokens)
raise BudgetExceeded(scope, key, spent, limit)
return spent
# In the call wrapper
budgets.check_and_charge("task", task_id, est_tokens, TASK_LIMIT)
budgets.check_and_charge("tenant", tenant_id, est_tokens, tenant_daily_limit(tenant_id))
result = call_model(...)
# True-up after actual usage
diff = result.usage.total_tokens - est_tokens
if diff != 0:
budgets.check_and_charge("task", task_id, diff, TASK_LIMIT)
budgets.check_and_charge("tenant", tenant_id, diff, tenant_daily_limit(tenant_id))
Backed by Redis or DynamoDB. 100 lines, ships in a day, prevents the worst surprise invoices.