Agent behavior spans multiple model calls, tool executions, state transitions, retries, handoffs, and approvals. Traditional request logs are not enough to explain failures across that chain.
A trace connects all work performed for one goal. Spans represent individual model calls, tools, retrieval steps, policy decisions, and human reviews. Metrics summarize patterns across many traces.
Good observability supports debugging, evaluation, security, cost control, and product improvement while minimizing sensitive data collection.
A final answer is the last frame of a long process. Observability records the process well enough that an engineer can explain what happened without rerunning it from memory.
Assign a stable run ID to the user goal and a span ID to each operation. Record parent-child relationships so a slow or failed final answer can be traced back to the responsible model or tool step.
Include workflow version, model version, prompt version, tool version, tenant, environment, and final status as structured metadata.
Log the action selected, sanitized arguments, tool outcome, policy result, retry reason, and compact state delta. Avoid relying only on raw prompt dumps.
Concise decision summaries are usually more useful and safer than storing private hidden reasoning. Store enough evidence to reproduce the control flow.
Operational metrics include latency, token usage, tool errors, retries, timeouts, and cost. Product metrics include task completion, user correction, escalation, abandonment, and business outcome.
A fast cheap agent that gives wrong answers is not healthy, and a highly accurate agent that takes five minutes may still fail the user experience.
Prompts and tool results may contain personal data, credentials, proprietary documents, or regulated information. Define a logging policy before enabling detailed traces.
Redact secrets, hash or tokenize identifiers where possible, restrict trace access, set retention periods, and let tenants opt out when required.
Compare successful and failed traces for the same task type. Differences in retrieved context, prompt version, model version, tool arguments, or retry behavior often reveal the fault quickly.
Create trace-to-evaluation workflows so production failures can become regression tests after privacy review.
Agent observability should capture decisions and evidence without requiring hidden chain-of-thought. A useful trace records the user request, selected context, model inputs where appropriate, structured outputs, tool calls, tool results, guardrail decisions, approvals, retries, budgets, and final status. This is enough to debug behavior while respecting privacy and model boundaries.
The goal of tracing is accountability. If an agent sends a draft to the wrong queue, you need to see whether retrieval returned the wrong policy, the model selected the wrong tool, the tool schema allowed an ambiguous field, or the approval UI hid an important detail. Without spans around each phase, every incident becomes guesswork.
Design trace fields before production. Retrofitting observability after a failure is painful because the missing evidence is gone. At minimum, include run ID, user or tenant ID, agent version, instruction version, model, tool name, arguments classification, status, latency, token usage, approval decision, and error class.
Traditional service metrics tell you whether the system is up. Agent metrics tell you whether it is useful and safe. Track task success, unsafe action attempts, tool-choice errors, no-evidence answers, approval rejection rate, repeated loops, budget exhaustion, citation correctness, and reviewer edit distance.
Separate quality monitoring from infrastructure monitoring. A low error rate can hide bad answers. A high approval rejection rate may mean the model is proposing risky actions, the UI is unclear, or the policy is too strict. Good dashboards help product, engineering, and security teams see the same reality.
Observability becomes powerful when the team reviews traces regularly, not only during incidents. Pick successful runs, failed runs, and near misses. For each one, inspect context selection, model decisions, tool calls, guardrails, approvals, costs, latency, and final output.
The key question is: could a new engineer explain why the agent behaved this way? If not, the trace is missing important evidence or the system is too implicit. A good trace should reveal the first wrong decision, not merely the final wrong answer.
Trace review also improves product quality. Repeated user edits may reveal weak instructions. Frequent tool denials may reveal confusing tool descriptions. Long latency spans may reveal retrieval or backend bottlenecks. Observability connects engineering behavior to user experience.
Treat trace privacy carefully. Redact secrets, access tokens, private documents, and unnecessary personal data. The trace should be useful enough to debug but controlled enough to share safely with the right team.
Open three traces and annotate them manually. Mark the goal, selected context, model decision, tool call, guardrail result, approval decision, cost, latency, and stop reason. This exercise quickly shows whether traces contain the evidence needed for debugging.
If a trace cannot explain why an action happened, add instrumentation. If it exposes too much sensitive data, add redaction. The best traces are useful, safe, and readable by engineers who did not write the original code.
Keep a small library of annotated traces for onboarding, release review, and incident drills so debugging knowledge does not stay trapped with one engineer.
For expert-level work, keep this page connected to an actual run trace. Concepts become much easier to understand when learners can see the input, state, model decision, tool behavior, safety check, and final outcome side by side.
A small context manager captures duration and status for every operation.
from contextlib import contextmanager
from time import perf_counter
from uuid import uuid4
run_id = str(uuid4())
@contextmanager
def span(name: str, **metadata):
span_id = str(uuid4())
started = perf_counter()
print({"event": "span_started", "run_id": run_id, "span_id": span_id,
"name": name, "metadata": metadata})
try:
yield span_id
status = "ok"
except Exception:
status = "error"
raise
finally:
print({"event": "span_finished", "run_id": run_id, "span_id": span_id,
"name": name, "status": status,
"latency_ms": round((perf_counter() - started) * 1000, 2)})
with span("search_policy", tool_version="2.1"):
result = {"matches": 3}
A small allowlist prevents accidental storage of raw secrets.
ALLOWED_FIELDS = {"tool", "status", "latency_ms", "result_count"}
def sanitize_trace(details: dict) -> dict:
return {
key: value
for key, value in details.items()
if key in ALLOWED_FIELDS
}
raw = {
"tool": "lookup_customer",
"status": "ok",
"latency_ms": 84,
"email": "private@example.com",
"api_key": "secret",
"result_count": 1,
}
print(sanitize_trace(raw))
Only when justified and protected. Prefer structured metadata, approved samples, redaction, and limited retention.
Logs are individual events. Traces connect related events into the end-to-end execution path for one run.
Start with task success, failure and escalation rates, p95 latency, and cost per successful task.
Explore 500+ free tutorials across 20+ languages and frameworks.