Tutorials Logic, IN info@tutorialslogic.com

AI Agent Observability and Tracing: Runs, Spans, Metrics, and Debugging

AI Agent Observability and Tracing

Agent behavior spans multiple model calls, tool executions, state transitions, retries, handoffs, and approvals. Traditional request logs are not enough to explain failures across that chain.

A trace connects all work performed for one goal. Spans represent individual model calls, tools, retrieval steps, policy decisions, and human reviews. Metrics summarize patterns across many traces.

Good observability supports debugging, evaluation, security, cost control, and product improvement while minimizing sensitive data collection.

Mental Model

A final answer is the last frame of a long process. Observability records the process well enough that an engineer can explain what happened without rerunning it from memory.

Model the Trace Hierarchy

Assign a stable run ID to the user goal and a span ID to each operation. Record parent-child relationships so a slow or failed final answer can be traced back to the responsible model or tool step.

Include workflow version, model version, prompt version, tool version, tenant, environment, and final status as structured metadata.

Capture Decisions and State Changes

Log the action selected, sanitized arguments, tool outcome, policy result, retry reason, and compact state delta. Avoid relying only on raw prompt dumps.

Concise decision summaries are usually more useful and safer than storing private hidden reasoning. Store enough evidence to reproduce the control flow.

Measure Operational and Product Outcomes

Operational metrics include latency, token usage, tool errors, retries, timeouts, and cost. Product metrics include task completion, user correction, escalation, abandonment, and business outcome.

A fast cheap agent that gives wrong answers is not healthy, and a highly accurate agent that takes five minutes may still fail the user experience.

  • Track p50, p95, and p99 latency by step type.
  • Measure cost and model calls per successful task.
  • Count tool failures, repeated actions, and budget exhaustion.
  • Connect traces to evaluation and user-feedback outcomes.

Protect Sensitive Data

Prompts and tool results may contain personal data, credentials, proprietary documents, or regulated information. Define a logging policy before enabling detailed traces.

Redact secrets, hash or tokenize identifiers where possible, restrict trace access, set retention periods, and let tenants opt out when required.

Debug with Trace Comparisons

Compare successful and failed traces for the same task type. Differences in retrieved context, prompt version, model version, tool arguments, or retry behavior often reveal the fault quickly.

Create trace-to-evaluation workflows so production failures can become regression tests after privacy review.

Trace the Reasoning Path, Not Private Thoughts

Agent observability should capture decisions and evidence without requiring hidden chain-of-thought. A useful trace records the user request, selected context, model inputs where appropriate, structured outputs, tool calls, tool results, guardrail decisions, approvals, retries, budgets, and final status. This is enough to debug behavior while respecting privacy and model boundaries.

The goal of tracing is accountability. If an agent sends a draft to the wrong queue, you need to see whether retrieval returned the wrong policy, the model selected the wrong tool, the tool schema allowed an ambiguous field, or the approval UI hid an important detail. Without spans around each phase, every incident becomes guesswork.

Design trace fields before production. Retrofitting observability after a failure is painful because the missing evidence is gone. At minimum, include run ID, user or tenant ID, agent version, instruction version, model, tool name, arguments classification, status, latency, token usage, approval decision, and error class.

  • Use spans for model calls, tool calls, retrieval, guardrails, and handoffs.
  • Redact sensitive input and output fields before export.
  • Connect trace IDs to user support tickets and audit logs.
  • Track model, prompt, tool, and policy versions together.
  • Review failed and near-miss traces as training data for tests.

Operational Signals for Agents

Traditional service metrics tell you whether the system is up. Agent metrics tell you whether it is useful and safe. Track task success, unsafe action attempts, tool-choice errors, no-evidence answers, approval rejection rate, repeated loops, budget exhaustion, citation correctness, and reviewer edit distance.

Separate quality monitoring from infrastructure monitoring. A low error rate can hide bad answers. A high approval rejection rate may mean the model is proposing risky actions, the UI is unclear, or the policy is too strict. Good dashboards help product, engineering, and security teams see the same reality.

  • Measure outcomes by workflow, not only by endpoint.
  • Sample successful traces as well as failed traces.
  • Watch for drift after model, prompt, retrieval, or policy changes.
  • Turn incidents and reviewer corrections into regression tests.
  • Define alert thresholds for safety metrics, not only uptime.

Trace Review as a Team Habit

Observability becomes powerful when the team reviews traces regularly, not only during incidents. Pick successful runs, failed runs, and near misses. For each one, inspect context selection, model decisions, tool calls, guardrails, approvals, costs, latency, and final output.

The key question is: could a new engineer explain why the agent behaved this way? If not, the trace is missing important evidence or the system is too implicit. A good trace should reveal the first wrong decision, not merely the final wrong answer.

Trace review also improves product quality. Repeated user edits may reveal weak instructions. Frequent tool denials may reveal confusing tool descriptions. Long latency spans may reveal retrieval or backend bottlenecks. Observability connects engineering behavior to user experience.

Treat trace privacy carefully. Redact secrets, access tokens, private documents, and unnecessary personal data. The trace should be useful enough to debug but controlled enough to share safely with the right team.

  • Review a sample of successful and failed traces.
  • Look for the first wrong decision in the run.
  • Connect trace patterns to product improvements.
  • Redact sensitive data before export or sharing.

Expert Practice Lab

Open three traces and annotate them manually. Mark the goal, selected context, model decision, tool call, guardrail result, approval decision, cost, latency, and stop reason. This exercise quickly shows whether traces contain the evidence needed for debugging.

If a trace cannot explain why an action happened, add instrumentation. If it exposes too much sensitive data, add redaction. The best traces are useful, safe, and readable by engineers who did not write the original code.

  • Annotate traces for decisions and evidence.
  • Add missing spans around important steps.
  • Redact sensitive context before sharing.

Final Expert Note

Keep a small library of annotated traces for onboarding, release review, and incident drills so debugging knowledge does not stay trapped with one engineer.

Review Margin

For expert-level work, keep this page connected to an actual run trace. Concepts become much easier to understand when learners can see the input, state, model decision, tool behavior, safety check, and final outcome side by side.

Nested Run and Span Tracing

A small context manager captures duration and status for every operation.

Nested Run and Span Tracing
from contextlib import contextmanager
from time import perf_counter
from uuid import uuid4

run_id = str(uuid4())

@contextmanager
def span(name: str, **metadata):
    span_id = str(uuid4())
    started = perf_counter()
    print({"event": "span_started", "run_id": run_id, "span_id": span_id,
           "name": name, "metadata": metadata})
    try:
        yield span_id
        status = "ok"
    except Exception:
        status = "error"
        raise
    finally:
        print({"event": "span_finished", "run_id": run_id, "span_id": span_id,
               "name": name, "status": status,
               "latency_ms": round((perf_counter() - started) * 1000, 2)})

with span("search_policy", tool_version="2.1"):
    result = {"matches": 3}
  • All spans share one run ID.
  • Status and duration are emitted even when an exception occurs.
  • Production tracing should send structured events to a protected backend.

Redact Sensitive Trace Fields

A small allowlist prevents accidental storage of raw secrets.

Redact Sensitive Trace Fields
ALLOWED_FIELDS = {"tool", "status", "latency_ms", "result_count"}

def sanitize_trace(details: dict) -> dict:
    return {
        key: value
        for key, value in details.items()
        if key in ALLOWED_FIELDS
    }

raw = {
    "tool": "lookup_customer",
    "status": "ok",
    "latency_ms": 84,
    "email": "private@example.com",
    "api_key": "secret",
    "result_count": 1,
}

print(sanitize_trace(raw))
  • Allowlisting is safer than trying to recognize every secret format.
  • Sensitive values never enter the trace object.
  • Different tools may need separate approved metadata schemas.
Key Takeaways
  • Assign run and span identifiers to every agent operation.
  • Record versions, decisions, tool outcomes, and compact state changes.
  • Measure quality, latency, cost, retries, and user outcomes together.
  • Redact sensitive data before it enters the tracing system.
  • Turn reviewed production failures into regression evaluations.
Common Mistakes to Avoid
Logging only the final answer and losing the action history.
Collecting full prompts and tool payloads without a privacy policy.
Tracking average latency while ignoring slow-tail behavior.
Building dashboards without connecting metrics to task success.

Practice Tasks

  • Add run IDs and nested spans to a three-step agent.
  • Create an allowlist-based trace sanitizer.
  • Define five operational and five product metrics.
  • Compare one successful and one failed trace and write the likely root cause.

Frequently Asked Questions

Only when justified and protected. Prefer structured metadata, approved samples, redaction, and limited retention.

Logs are individual events. Traces connect related events into the end-to-end execution path for one run.

Start with task success, failure and escalation rates, p95 latency, and cost per successful task.

Ready to Level Up Your Skills?

Explore 500+ free tutorials across 20+ languages and frameworks.