AI Agent Architecture: Models, Tools, Planner, Runtime and Observability

Production Rule: Never let the model be the permission system. The model can request an action; deterministic code should decide whether the action is allowed.

Core Components

A production AI agent is not a single prompt. It is an application architecture. The model may decide or reason, but the surrounding software controls what the model sees, which tools it can use, how results are checked, and when the task must stop.

Think of a restaurant kitchen. The chef is important, but the kitchen also needs ingredients, stations, order tickets, quality checks, safety rules, and timing. In an agent system, the model is like the chef. The runtime, tool registry, memory, guardrails, and observability are the kitchen.

The best agent architecture separates reasoning from execution. The model proposes an action. The runtime validates it. The tool performs it. The observation returns to state. The next decision is made with updated context.

The model reads the current task and context. The instruction layer tells the model its role, boundaries, and response format. The tool registry describes callable capabilities. The state store remembers progress. The executor performs approved actions. Observability records what happened.

Each component should have an owner. If something goes wrong, engineers need to know whether the issue came from a prompt, model choice, bad tool schema, missing context, unsafe permission, network failure, or weak evaluation.

Model: reasons over the task and available information.
Instructions: define role, style, constraints, and success criteria.
Tool registry: exposes safe, typed actions.
State: stores task progress and observations.
Executor: validates and runs tool calls.
Evaluator: checks quality, policy, correctness, and completion.
Telemetry: logs traces, latency, cost, tool calls, errors, and user outcomes.

Text Diagram: Production Agent Request

This request path is common in enterprise systems where an agent helps users but does not get unlimited access to company systems.

User -> API Gateway -> Auth Context
Auth Context -> Agent Runtime -> Prompt Builder
Prompt Builder -> LLM -> Proposed Tool Call
Proposed Tool Call -> Policy Engine -> Tool Executor
Tool Executor -> Logs + Observation -> Agent State
Agent State -> LLM -> Final Answer or Next Action

Planner, Router and Worker Patterns

Some agents use one model call at a time. Larger systems may split responsibility. A planner breaks the goal into steps. A router chooses the right specialist. A worker performs a narrow task. A reviewer checks the answer before the user sees it.

This pattern is useful when tasks are complex, but it can increase cost and latency. Do not add multiple agents just because the diagram looks impressive. Add them when separate responsibilities improve reliability.

Planner-worker: one component plans, another executes.
Router-specialist: one component sends the task to the right expert prompt or service.
Reviewer: a second pass checks facts, policy, style, or schema.
Human gate: sensitive actions pause for approval.

Architecture Decisions That Matter

The most important architecture decisions are not glamorous. You must decide how many steps are allowed, which tools are exposed, what data enters the prompt, how secrets are protected, how errors are retried, and how users can understand what the agent did.

A reliable agent is often boring inside. It has small tools, strict schemas, narrow permissions, good logs, and clear fallback behavior.

Prefer small tool surfaces over broad admin-like tools.
Store trace IDs so support engineers can debug user reports.
Use idempotency keys for actions that might be retried.
Separate read tools from write tools.
Keep high-risk write actions behind approvals.

The Production Agent Stack

A production agent stack has layers. The interface captures user intent. The orchestrator manages the run. The context layer retrieves approved state, memory, and knowledge. The model layer performs language judgment. The tool layer executes bounded capabilities. The policy layer validates permissions and approvals. The observability layer records what happened.

These layers should be explicit even if the first implementation is small. When everything lives inside one prompt and one function, it becomes difficult to test, secure, debug, or improve. Clear boundaries let you swap a model, change a tool, tighten policy, or update retrieval without rewriting the whole system.

The most important boundary is between proposal and execution. The model may propose a tool call, plan, draft, or answer. Trusted code must validate the proposal against schema, policy, budget, and user context. This is what makes an agent an application rather than a model improvising with credentials.

Architecture should also include stop conditions. Agents need maximum steps, maximum cost, maximum tool calls, timeout limits, repeated-action detection, and safe fallback responses. Without stop rules, a clever loop can become an expensive failure.

Make orchestration, context, tools, policy, and observability separate concerns.
Keep model judgment inside bounded decisions.
Use trusted code for validation, permissions, budgets, and side effects.
Design stop conditions before adding more tools.
Prefer one reliable agent loop before adding multiple agents.

Architecture Review in the Right Order

Review an agent architecture from outside to inside. First define the user and workflow outcome. Next list external systems and side effects. Then design tool contracts and permissions. After that, decide what context the model needs. Only then choose prompts, planning style, model, and framework.

This order prevents model-first design. If you begin by asking "which model should we use," you will miss the harder questions: what action is allowed, what evidence is required, who approves, what happens on failure, and how success will be measured.

For every architecture, run a tabletop exercise. Walk through a happy path, a no-evidence path, a tool timeout, an injection attempt, a denied permission, a user cancellation, and a model mistake. If the system has no answer for one of these, the architecture is incomplete.

Finally, map each failure to a trace signal. Production debugging depends on knowing which layer failed: input understanding, retrieval, planning, validation, authorization, tool execution, approval, or final formatting.

Start with workflow value, not model capability.
Design tools and permissions before prompts.
Test happy paths and failure paths during architecture review.
Name the owner of every state field and external action.
Ensure every important decision appears in traces.

Architecture Walkthrough Exercise

Take one real user request and trace it through the architecture. Write down the input, trusted context, retrieved context, model decision, validated action, tool result, state update, approval decision, and final response. If any step is invisible, add instrumentation or simplify the design.

Then repeat the walkthrough for a failure case: missing evidence, denied permission, invalid tool arguments, or a timeout. The architecture should show how the system recovers or stops safely. An agent design is incomplete if it only describes the happy path.

This walkthrough is useful for code review because it moves discussion away from vague agent behavior and toward concrete responsibilities. Each layer either owns a decision or it does not.

Walk through success and failure paths.
Assign every decision to model, runtime, policy, tool, or user.
Make hidden context assembly visible.
Use the walkthrough to identify missing traces and tests.

Control and Data Planes

Separate the control plane from the data plane. The control plane selects models and tools, enforces budgets, evaluates guardrails, records state, schedules retries, and decides whether a human must intervene. The data plane carries user input, retrieved evidence, tool arguments, tool results, and model output. Mixing them lets untrusted content influence authority decisions.

The runtime should express each transition as typed state: current goal, evidence references, pending action, approval status, usage, retry count, and stop reason. Model prose can propose a transition, but application code validates it before mutating durable state or calling a tool.

Keep providers behind adapters and domain actions behind narrow interfaces. A model change should not rewrite permission logic, and a tool SDK change should not alter business invariants. Trace identifiers connect model calls, retrieval, tools, approvals, and backend commits into one explainable run.

Keep policy and authority outside model-visible content.
Represent workflow transitions as validated typed state.
Isolate provider APIs from domain actions.
Correlate every external effect with one run and action ID.

Agent Runtime Examples

Tool Call Validation Shape

This example shows the architecture idea: the model may request a tool call, but code validates the name and arguments before execution.

Tool Call Validation Shape

ALLOWED_TOOLS = {"lookup_customer": {"required": {"customer_id"}}}

def validate_tool_call(call: dict) -> None:
    tool_name = call.get("name")
    arguments = call.get("arguments", {})

    if tool_name not in ALLOWED_TOOLS:
        raise ValueError(f"Tool is not allowed: {tool_name}")

    required = ALLOWED_TOOLS[tool_name]["required"]
    missing = required - set(arguments)
    if missing:
        raise ValueError(f"Missing required arguments: {sorted(missing)}")

def execute_tool(call: dict) -> dict:
    validate_tool_call(call)
    return {"customer_id": call["arguments"]["customer_id"], "status": "active"}

model_request = {"name": "lookup_customer", "arguments": {"customer_id": "C-1042"}}
print(execute_tool(model_request))

Validation happens before execution.
The allowlist prevents the model from inventing powerful tool names.
Production systems should also validate types, authorization, rate limits, and audit metadata.

Tool Policy Gate with an Audit Trace

The runtime validates a proposed tool call, records the decision, and only then executes the tool. This is the separation between reasoning and execution described in the architecture.

Tool Policy Gate with an Audit Trace

from dataclasses import dataclass, asdict

@dataclass
class ToolCall:
    name: str
    arguments: dict

ALLOWED_TOOLS = {"lookup_order"}
trace = []

def authorize(call: ToolCall, user_tenant: str) -> bool:
    allowed = (
        call.name in ALLOWED_TOOLS
        and call.arguments.get("tenant") == user_tenant
    )
    trace.append({"event": "authorization", "call": asdict(call), "allowed": allowed})
    return allowed

def execute(call: ToolCall) -> dict:
    return {"order_id": call.arguments["order_id"], "status": "shipped"}

call = ToolCall("lookup_order", {"order_id": "ORD-17", "tenant": "acme"})
result = execute(call) if authorize(call, "acme") else {"error": "denied"}

print(result)
print(trace)

Tool availability and tenant access are checked outside the model.
The trace records both allowed and denied decisions for review.
Change the tenant argument to verify the denied path.

Before you move on