LangGraph Errors and Retries: Failure Taxonomy, RetryPolicy, and Recovery Design

Not All Failures Are the Same

Production graph design is not about pretending failures will not happen. It is about deciding which failures deserve automatic retry, which deserve alternate routes, and which should stop the workflow immediately.

LangGraph makes those choices visible because failures can be handled at the node, route, and persistence layers instead of disappearing inside a generic agent loop.

This page focuses on the recovery mindset: classify failures first, then choose the smallest safe retry or fallback behavior that fits each class.

A timeout from a remote API is different from invalid state, which is different from a rejected human approval, which is different from a tool permission violation. Retrying all of them the same way wastes money and can make incidents worse.

Good recovery begins with a failure taxonomy: transient, validation, business-rule, external-system, and unknown.

Transient: network jitter, short provider outages
Validation: malformed inputs or missing fields
Business-rule: action blocked by policy
External-system: downstream service hard failure

Retry Only Where Another Attempt Can Realistically Help

A retry is justified when the next attempt has a meaningful chance of success and the side effects are safe. It is not justified when the error comes from invalid user data, a permanent permission issue, or a deterministic policy rejection.

This sounds obvious, but many agent systems quietly retry everything because it is easier to wire one generic loop than to think about failure semantics.

Retry transient network and rate-limit failures.
Do not retry deterministic validation failures without changing input.
Escalate or fail fast when business rules explicitly forbid the action.

RetryPolicy and Graph-Level Recovery

Current LangGraph docs cover timeout and error-handling primitives such as `RetryPolicy`. The important architectural point is not merely that a retry helper exists. It is that retries should align with state, observability, and eventual fallback routing.

A node-level retry can smooth over flaky infrastructure, but the graph should still record what happened so operators can distinguish a clean run from a self-healing one.

Count retries in state or logs.
Record the final error when retries exhaust.
Route exhausted failures to summary, review, or manual ops.

Idempotency Is the Hidden Requirement

Retries are safe only when repeating the action is safe. Read-only lookups are usually fine. Charge creation, email sending, or refund issuance can duplicate side effects if you retry blindly.

That is why risky write actions need idempotency keys, confirmation checks, or human approval paths before automated retry enters the picture.

Distinguish read retries from write retries.
Use idempotency controls for external mutations.
Log the action identity so duplicate effects are detectable.

Execution Analysis: Fail, Retry, Recover, or Escalate

A healthy recovery flow looks like this: a node fails, the error is classified, a limited retry occurs if appropriate, state records the attempt count, and the graph either succeeds, routes to fallback, or escalates. Nothing is hidden.

This structure gives you post-incident answers. You know not only that the run recovered, but how much effort recovery cost and which boundary first failed.

Start state includes retry_count or failure metadata
Node fails with typed or inspectable error
Recovery logic decides retry versus alternate path
Exhausted failure reaches a clear terminal or review path

Classify Failures Before Retrying

Retries are useful only for failures that may succeed later. A network timeout, rate limit, or temporary backend outage may be retryable. A validation error, permission denial, missing required field, or unsafe action should not be retried automatically. Retrying the wrong failure wastes cost and can create repeated side effects.

Use different error classes for different behavior. Validation errors should route to clarification or developer fixes. Authorization errors should route to denial or permission request. Tool timeouts may route to retry, fallback, or deferred work. Model formatting errors may route to structured-output repair with a strict limit.

Retries need budgets. Configure maximum attempts, backoff, timeout, and stop reason. If a node keeps failing, the graph should preserve the error context and move to a safe state rather than hiding the failure inside an infinite loop.

For write actions, retry only with idempotency. If the graph cannot prove that retrying is safe, ask for human review or reconcile the external system first.

Retry transient failures, not policy failures.
Use separate error classes for validation, auth, timeout, and backend errors.
Set maximum attempts and backoff.
Preserve error context in state.
Use idempotency keys for retried write operations.

Recovery Routes and User Experience

A graph should not simply crash when a node fails. It should route to recovery when possible: ask the user for missing input, call an alternate tool, use cached evidence, pause for review, or return a partial result with a clear limitation. Recovery behavior should be explicit in the graph design.

User-facing errors should be truthful and useful. "The workflow could not verify the invoice because the purchase-order system timed out" is better than "Something went wrong." The message should say what completed, what failed, and what the user can do next.

Use checkpoints to make failures resume-safe. If a process crashes after a node fails but before the handler finishes, the graph should be able to reconstruct the failure context and continue the recovery path. This is where durable execution becomes more than convenience.

Evaluate failure handling with tests. Simulate tool timeouts, malformed model output, denied authorization, missing approval, and interrupted workers. A graph that only works when every dependency behaves is not production-ready.

Model recovery routes as first-class graph paths.
Return partial results when they are useful and safe.
Make failure messages specific without leaking sensitive internals.
Use checkpoints for resume-safe recovery.
Test every important failure path deliberately.

Failure Policy Exercise

Create a table of graph failures and decide the policy for each one: retry, clarify, deny, fallback, pause, or fail. Include rate limits, network timeouts, validation errors, authorization denials, malformed model output, missing user input, and external write uncertainty.

Then write tests for the policies. A retry policy is not real until a test proves it stops after the right number of attempts. A recovery route is not real until a test proves state is preserved and the user receives a useful result.

Finally, check cost and safety. Automatic retries can multiply model calls and tool calls quickly. Retrying a write operation without idempotency can create duplicate external effects.

Map each failure class to a recovery policy.
Test retry ceilings and fallback routes.
Preserve failure context in state.
Require idempotency for retried writes.

Failure Ownership

Classify failures before choosing recovery. Invalid state and programmer errors should fail fast; a rate limit or transient network error may retry; denied access should route to a safe terminal response; missing user input may interrupt; a permanent backend conflict may need compensation or human review.

Attach retry policy to the smallest idempotent node and restrict exception types. Bound attempts and backoff, record attempt count, and keep external writes deduplicated. Retrying an entire graph because one API timed out can repeat completed model calls and tools unnecessarily.

Interrupts are control flow, not ordinary errors, and should bypass retry handling. When a superstep contains parallel nodes, completed pending writes can be checkpointed even if another node fails, allowing resume without rerunning successful siblings. Test crash timing and inspect state history so recovery assumptions match actual checkpoints.

Map each failure class to retry, route, interrupt, compensate, or abort.
Retry only bounded idempotent work.
Preserve the original error and attempt history in traces.
Test partial superstep failure and resume behavior.

Failure Recovery Examples

Beginner Example: Track Retries in State

Even before adding framework retry helpers, make retry behavior visible in the state contract.

Beginner Example: Track Retries in State

from typing_extensions import TypedDict

class ToolState(TypedDict):
    retry_count: int
    status: str

def record_failure(state: ToolState) -> dict:
    return {
        "retry_count": state["retry_count"] + 1,
        "status": "retrying",
    }

A retry counter makes loop behavior inspectable.
Status fields help dashboards and tests explain what happened.
Do not hide repeated failure behind silent recursion.

Intermediate Example: Route After Retries Exhaust

Recovery needs a final destination when more attempts are not worth making.

Intermediate Example: Route After Retries Exhaust

from typing_extensions import TypedDict, Literal

class RetryState(TypedDict):
    retry_count: int
    last_error: str

def route_after_failure(state: RetryState) -> Literal["retry_tool", "manual_review"]:
    if state["retry_count"] < 2:
        return "retry_tool"
    return "manual_review"

The fallback path is explicit.
This pattern works for APIs, extraction, and flaky searches.
Operators can see when automation gave up and why.

Advanced Example: Attach a RetryPolicy

LangGraph supports retry policies for transient failures; pair them with logging and fallback design rather than treating them as a magic fix.

Advanced Example: Attach a RetryPolicy

from langgraph.types import RetryPolicy

retry_policy = RetryPolicy(
    max_attempts=3,
)

# A node definition or graph setup can attach this policy
# depending on the style and API surface you are using.

Automatic retries should focus on transient failures.
A retry helper does not replace business-aware fallback routes.
Pair policy-driven retries with clear observability.

Before you move on