Production graph design is not about pretending failures will not happen. It is about deciding which failures deserve automatic retry, which deserve alternate routes, and which should stop the workflow immediately.
LangGraph makes those choices visible because failures can be handled at the node, route, and persistence layers instead of disappearing inside a generic agent loop.
This page focuses on the recovery mindset: classify failures first, then choose the smallest safe retry or fallback behavior that fits each class.
A timeout from a remote API is different from invalid state, which is different from a rejected human approval, which is different from a tool permission violation. Retrying all of them the same way wastes money and can make incidents worse.
Good recovery begins with a failure taxonomy: transient, validation, business-rule, external-system, and unknown.
A retry is justified when the next attempt has a meaningful chance of success and the side effects are safe. It is not justified when the error comes from invalid user data, a permanent permission issue, or a deterministic policy rejection.
This sounds obvious, but many agent systems quietly retry everything because it is easier to wire one generic loop than to think about failure semantics.
Current LangGraph docs cover timeout and error-handling primitives such as `RetryPolicy`. The important architectural point is not merely that a retry helper exists. It is that retries should align with state, observability, and eventual fallback routing.
A node-level retry can smooth over flaky infrastructure, but the graph should still record what happened so operators can distinguish a clean run from a self-healing one.
Retries are safe only when repeating the action is safe. Read-only lookups are usually fine. Charge creation, email sending, or refund issuance can duplicate side effects if you retry blindly.
That is why risky write actions need idempotency keys, confirmation checks, or human approval paths before automated retry enters the picture.
A healthy recovery flow looks like this: a node fails, the error is classified, a limited retry occurs if appropriate, state records the attempt count, and the graph either succeeds, routes to fallback, or escalates. Nothing is hidden.
This structure gives you post-incident answers. You know not only that the run recovered, but how much effort recovery cost and which boundary first failed.
Retries are useful only for failures that may succeed later. A network timeout, rate limit, or temporary backend outage may be retryable. A validation error, permission denial, missing required field, or unsafe action should not be retried automatically. Retrying the wrong failure wastes cost and can create repeated side effects.
Use different error classes for different behavior. Validation errors should route to clarification or developer fixes. Authorization errors should route to denial or permission request. Tool timeouts may route to retry, fallback, or deferred work. Model formatting errors may route to structured-output repair with a strict limit.
Retries need budgets. Configure maximum attempts, backoff, timeout, and stop reason. If a node keeps failing, the graph should preserve the error context and move to a safe state rather than hiding the failure inside an infinite loop.
For write actions, retry only with idempotency. If the graph cannot prove that retrying is safe, ask for human review or reconcile the external system first.
A graph should not simply crash when a node fails. It should route to recovery when possible: ask the user for missing input, call an alternate tool, use cached evidence, pause for review, or return a partial result with a clear limitation. Recovery behavior should be explicit in the graph design.
User-facing errors should be truthful and useful. "The workflow could not verify the invoice because the purchase-order system timed out" is better than "Something went wrong." The message should say what completed, what failed, and what the user can do next.
Use checkpoints to make failures resume-safe. If a process crashes after a node fails but before the handler finishes, the graph should be able to reconstruct the failure context and continue the recovery path. This is where durable execution becomes more than convenience.
Evaluate failure handling with tests. Simulate tool timeouts, malformed model output, denied authorization, missing approval, and interrupted workers. A graph that only works when every dependency behaves is not production-ready.
Create a table of graph failures and decide the policy for each one: retry, clarify, deny, fallback, pause, or fail. Include rate limits, network timeouts, validation errors, authorization denials, malformed model output, missing user input, and external write uncertainty.
Then write tests for the policies. A retry policy is not real until a test proves it stops after the right number of attempts. A recovery route is not real until a test proves state is preserved and the user receives a useful result.
Finally, check cost and safety. Automatic retries can multiply model calls and tool calls quickly. Retrying a write operation without idempotency can create duplicate external effects.
Even before adding framework retry helpers, make retry behavior visible in the state contract.
from typing_extensions import TypedDict
class ToolState(TypedDict):
retry_count: int
status: str
def record_failure(state: ToolState) -> dict:
return {
"retry_count": state["retry_count"] + 1,
"status": "retrying",
}
Recovery needs a final destination when more attempts are not worth making.
from typing_extensions import TypedDict, Literal
class RetryState(TypedDict):
retry_count: int
last_error: str
def route_after_failure(state: RetryState) -> Literal["retry_tool", "manual_review"]:
if state["retry_count"] < 2:
return "retry_tool"
return "manual_review"
LangGraph supports retry policies for transient failures; pair them with logging and fallback design rather than treating them as a magic fix.
from langgraph.types import RetryPolicy
retry_policy = RetryPolicy(
max_attempts=3,
)
# A node definition or graph setup can attach this policy
# depending on the style and API surface you are using.
Sometimes, especially for transient provider issues, but not without limits and observability.
In explicit graph logic so operators can see the recovery path rather than infer it from logs alone.
When you cannot explain which failures are retriable and why the graph decided to try again.
Explore 500+ free tutorials across 20+ languages and frameworks.