Agent cost and latency grow with the number of model calls, context size, output size, tool duration, retries, and handoffs. A small inefficiency repeated inside a loop can dominate the entire request.
Optimization starts with measurement. Break total time and cost down by step, then improve the largest contributors while protecting task quality and safety.
Budgets are product behavior. The system should know what to do when it reaches a token, time, tool-call, or monetary limit.
An agent run is a budgeted execution graph. Every model call, token, retrieval, tool, retry, and approval consumes time or money and should earn its place.
Define maximum model calls, tokens, tool calls, retries, wall-clock time, and estimated spend. Use different budgets for interactive, background, and high-value workflows.
A budget should end in a useful state: partial results, a concise explanation, a clarification request, or human escalation.
Use deterministic code for validation and calculation, search indexes for retrieval, small models for routine classification, and stronger models only where difficult reasoning improves outcomes.
Model routing needs evaluations and fallback behavior. Cheap but inaccurate routing can create more retries and cost than it saves.
Summarize old state, retrieve fewer higher-quality passages, remove duplicate instructions, and avoid resending large tool outputs. Cache stable retrieval, embeddings, deterministic tool results, and approved summaries where privacy permits.
Cache keys must include relevant user, tenant, permissions, version, and freshness information to prevent stale or cross-user results.
Independent read-only tools can run in parallel, reducing wall-clock time. Do not parallelize actions with ordering dependencies or conflicting writes.
Set per-tool timeouts and cancel work that is no longer needed after another branch produces sufficient evidence.
Stream useful progress events such as searching, validating, waiting for approval, and summarizing. Do not expose private reasoning or fill the interface with noisy internal events.
For long background work, return a job ID, allow cancellation, and notify the user when the result is ready.
Agent cost is not only the price of one model call. A run may include planning calls, tool calls, retrieval, summarization, verification, retries, embeddings, long context, and human review time. Latency is similarly cumulative. A fast model call can still create a slow product if the agent loops, retrieves too much, or waits on multiple remote systems sequentially.
Start every production design with explicit budgets: maximum model calls, maximum tool calls, maximum tokens, maximum wall-clock time, maximum retry count, and maximum cost per successful task. Then decide what happens when a budget is reached. The agent should return a useful partial result, ask for permission to continue, or escalate rather than spinning silently.
Measure cost per successful outcome, not cost per request. If a cheap model causes more retries, bad tool calls, and reviewer corrections, it may be more expensive than a stronger model for that step. Expert systems route by task difficulty, risk, and required reliability rather than always choosing the cheapest model.
The fastest agent is often the one that avoids unnecessary agent behavior. Use deterministic routing when rules are stable. Run independent reads in parallel. Retrieve targeted snippets instead of large documents. Stream progress when work is long, but do not use streaming as a substitute for good architecture.
For user experience, show meaningful milestones rather than raw implementation detail. "Searching policy", "Checking order status", and "Preparing draft" are useful. "Calling model 3" is not. If a workflow may exceed a few seconds, design cancellation and resume behavior from the beginning.
Cost and latency should be reviewed per successful task, not per model call. A cheap model that causes extra retries, wrong tool calls, or heavy human editing may cost more than a stronger model used selectively. Likewise, a fast first response is not useful if the workflow later stalls on tools or approval.
Break the run into phases: context assembly, model decision, tool execution, verification, approval, and final response. Measure each phase separately. This reveals whether the bottleneck is model latency, retrieval, external APIs, queue delay, or loop behavior.
Optimization should preserve safety. Do not remove verification, citations, or approval just to save seconds. Instead, route easier steps to smaller models, cache stable context, run independent reads in parallel, summarize long history, and stop loops earlier when progress is not improving.
Set explicit product budgets. For example: maximum cost per resolved ticket, maximum p95 time to draft, maximum steps before escalation, and maximum retries per tool. Budgets make tradeoffs visible to product and engineering teams.
Create a cost and latency budget for one workflow. Estimate the number of model calls, retrieval calls, tool calls, verification steps, and approval waits. Then compare the estimate to real traces after running test cases.
Use the difference to improve the architecture. If retrieval dominates latency, add filters or caching. If model calls dominate cost, route simpler steps to smaller models. If approvals dominate time, improve review payload clarity rather than removing safety.
Track budget exhaustion as its own outcome so teams know whether the system is stopping safely but too often for the product to feel useful.
For expert-level work, keep this page connected to an actual run trace. Concepts become much easier to understand when learners can see the input, state, model decision, tool behavior, safety check, and final outcome side by side.
The runtime checks several limits before starting another step.
from dataclasses import dataclass
from time import monotonic
@dataclass
class Budget:
max_steps: int = 6
max_tool_calls: int = 4
max_cost_usd: float = 0.25
timeout_seconds: float = 15.0
def budget_exhausted(state: dict, budget: Budget) -> str | None:
if state["steps"] >= budget.max_steps:
return "step_limit"
if state["tool_calls"] >= budget.max_tool_calls:
return "tool_limit"
if state["cost_usd"] >= budget.max_cost_usd:
return "cost_limit"
if monotonic() - state["started_at"] >= budget.timeout_seconds:
return "timeout"
return None
state = {
"steps": 3,
"tool_calls": 2,
"cost_usd": 0.08,
"started_at": monotonic(),
}
print(budget_exhausted(state, Budget()))
Independent tools run concurrently while preserving a single timeout.
import asyncio
async def get_order(order_id: str) -> dict:
await asyncio.sleep(0.2)
return {"order_id": order_id, "status": "shipped"}
async def get_policy(name: str) -> dict:
await asyncio.sleep(0.2)
return {"name": name, "window_days": 14}
async def gather_context():
order, policy = await asyncio.wait_for(
asyncio.gather(
get_order("ORD-17"),
get_policy("refund"),
),
timeout=1.0,
)
return {"order": order, "policy": policy}
print(asyncio.run(gather_context()))
Often repeated model calls with large contexts, but the answer depends on tool infrastructure and workload. Measure per step.
Use the least expensive model that meets the quality target. Difficult or high-risk steps may justify a stronger model.
It improves perceived responsiveness but does not necessarily reduce total completion time.
Explore 500+ free tutorials across 20+ languages and frameworks.