The model is one component of an agent, not the agent itself. It interprets the current task, available context, tool descriptions, and prior observations to propose the next useful action.
Instructions define the operating contract. They explain the agent role, success criteria, boundaries, tool-use rules, response format, and situations that require clarification or human escalation.
Reliable systems select models by task difficulty and risk, keep durable rules separate from user input, and validate structured outputs before downstream code trusts them.
The model supplies capability; the instruction and context layers turn that capability into a specific product behavior. Model quality cannot rescue unclear policy, missing context, or unsafe permissions.
Model selection is an engineering tradeoff among reasoning quality, instruction following, tool-call accuracy, context size, latency, availability, and price. A support classifier and a repository-wide coding agent do not need the same model.
Use evaluation data from your own tasks. Build a representative set of easy, ambiguous, adversarial, and failure cases, then compare candidate models on completion quality, invalid tool calls, latency, and cost.
Durable application rules belong in the system or developer instruction layer. The user message supplies the goal and relevant preferences, but it must not be allowed to rewrite security policy or grant itself tools.
A useful instruction set answers six questions: who is the agent, what outcome should it produce, what context may it trust, which tools may it request, when must it stop, and what output shape must it return?
Context engineering is the work of assembling the smallest useful set of instructions, state, retrieved evidence, tool descriptions, and conversation history for the next decision. More context is not automatically better.
Long, duplicated, or stale context can bury important rules and increase cost. Summarize old interaction history, retrieve only relevant knowledge, and clearly mark untrusted content such as webpages and uploaded documents.
Free-form prose is appropriate for a final explanation, but downstream automation needs a typed contract. Define fields, allowed values, required properties, and validation behavior before connecting model output to business logic.
Validation failure should become a normal workflow branch. The runtime can retry once with a concise correction, use a fallback parser, request clarification, or escalate instead of silently guessing.
Instructions are production code. Give them versions, review changes, and connect each release to an evaluation result. A small wording change can alter tool choice, refusal behavior, or output formatting.
Test instructions against normal tasks, underspecified requests, conflicting user instructions, prompt-injection attempts, missing evidence, tool failures, and requests outside the agent role.
Model selection should come after task design. First define the job, success criteria, allowed tools, context sources, output format, and risk level. Then choose the smallest reliable model for each step. A classification step, a retrieval query rewrite, and a legal-risk explanation may need different model routes because they have different cost, latency, and reliability requirements.
Instructions should describe role, goal, boundaries, tool-use rules, evidence rules, and stop behavior. Avoid writing instructions as motivational slogans. A good instruction tells the model what to do when evidence is missing, when a tool fails, when the user requests something outside policy, and when it should ask for clarification.
Structured outputs deserve special attention. If downstream code depends on fields, make the schema explicit and validate it outside the model. The model may generate JSON, but trusted code decides whether that JSON is complete, authorized, and safe to execute. This is especially important when structured output becomes tool arguments or workflow state.
Treat prompts and instructions as versioned product assets. Store the instruction version with traces and evaluation results. When quality changes, the team should know whether the cause was model choice, instruction wording, retrieval changes, tool behavior, or policy.
Create a model-routing table for one agent. For each step, list the task, required reasoning level, risk, expected output shape, latency budget, and fallback model. This makes model choice concrete instead of emotional.
Then write an instruction contract for the highest-risk step. Include what the model should do, what it must not do, how it should handle missing evidence, and when it should ask for clarification or escalate. Test the instruction with normal, ambiguous, and adversarial inputs.
Compare outputs across at least two model routes so the team can see whether a stronger model improves correctness, safety, latency, or reviewer effort enough to justify the cost.
For expert-level work, keep this page connected to an actual run trace. Concepts become much easier to understand when learners can see the input, state, model decision, tool behavior, safety check, and final outcome side by side.
This example validates a model-like decision before the runtime acts on it.
from dataclasses import dataclass
from typing import Literal
AllowedAction = Literal["search_policy", "ask_user", "finish"]
@dataclass
class AgentDecision:
action: AllowedAction
reason_summary: str
query: str | None = None
def validate_decision(raw: dict) -> AgentDecision:
allowed = {"search_policy", "ask_user", "finish"}
action = raw.get("action")
if action not in allowed:
raise ValueError(f"Unsupported action: {action}")
if action == "search_policy" and not raw.get("query"):
raise ValueError("search_policy requires a query")
return AgentDecision(
action=action,
reason_summary=str(raw.get("reason_summary", ""))[:200],
query=raw.get("query"),
)
decision = validate_decision({
"action": "search_policy",
"reason_summary": "The refund rule is not in the current context.",
"query": "refund eligibility window",
})
print(decision)
A simple router can keep routine work fast while reserving a stronger model for harder tasks.
def choose_model(task: dict) -> str:
if task["risk"] == "high":
return "strong-reasoning-model"
if task["type"] in {"classify", "extract", "format"}:
return "fast-small-model"
if task["context_tokens"] > 30_000:
return "long-context-model"
return "balanced-model"
task = {
"type": "classify",
"risk": "low",
"context_tokens": 1200,
}
print(choose_model(task))
Usually no. Ask for concise plans, decisions, evidence, and action summaries that can be inspected without requiring private hidden reasoning.
Long enough to define the contract, but no longer. Remove duplicated prose and move enforceable rules into code.
Only when your application explicitly supports that behavior. User content should never override security, permissions, or organizational policy.
Explore 500+ free tutorials across 20+ languages and frameworks.