LangGraph Memory and Checkpoints: Threads, Persistence, and Durable Runs

Short-Term State Versus Persistent Checkpoints

Memory in LangGraph is not one thing. There is the live state flowing through a single run, the persisted checkpoints that let that run pause and resume, and any longer-lived memory you store across many runs or threads.

Understanding those layers is essential because they solve different problems. Short-term state helps the graph keep context. Checkpoints help the runtime survive time, failure, and human pauses. Long-term memory helps the application remember beyond one thread.

This page will help you separate those concerns so your graphs remain durable without becoming bloated.

A graph state exists while a run is active. A checkpoint is a saved snapshot of that state at execution boundaries so the runtime can resume later. Official docs describe these snapshots as being organized under threads, which is why `thread_id` matters so much.

If you do not persist checkpoints, you can still run graphs. You just lose durable pause-resume behavior, time travel, and many human-in-the-loop patterns.

State is the current in-flight working memory.
Checkpoints are saved snapshots of that state.
Threads identify which conversation or workflow instance the runtime should load.

What Persistence Enables

Persistence turns graphs from request-bound pipelines into long-lived systems. A user can leave and return. A human reviewer can approve later. A worker can restart without forgetting the workflow position.

It also unlocks debugging capabilities such as inspecting state history or replaying from earlier checkpoints.

Human review and interrupts
Crash recovery
Conversation continuity
Time-travel style debugging

Choosing Between In-Memory and Durable Stores

In-memory persistence is excellent for local development and tests because it removes infrastructure overhead. It is not durable across process restarts. Production runs should use a persistent backend so checkpoints survive service restarts and distributed execution.

The practical rule is simple: if losing the process means losing business context, you need durable checkpoint storage.

In-memory saver for tutorials and tests
Database-backed checkpointing for real services
Separate local convenience from production guarantees

Long-Term Memory Is Broader Than Checkpoints

Checkpoint persistence is thread-scoped execution memory. Long-term memory is application memory that can outlive a thread and be reused later, such as customer preferences, prior decisions, semantic summaries, or retrieved knowledge.

Store those durable facts outside the graph state, then read them into state when a new run needs them. That keeps the graph state focused while still letting the application remember.

Use state for current run context.
Use long-term stores for reusable facts across runs.
Load only the memory needed for the current decision.

State Evolution and Backward Compatibility

Persisted threads create a schema evolution problem. If you change the state shape after deployment, older checkpoints may resume into newer graph code. That means your graph changes must be treated like a compatibility-sensitive API change.

Production teams should version important state changes, add migration logic when needed, and test resume behavior rather than assuming new code will fit old checkpoints cleanly.

Avoid casual renames of critical state fields.
Test resumed runs after schema changes.
Document which fields are safe to add, deprecate, or transform.

Checkpoints Are Execution History

A LangGraph checkpoint is more than saved memory. It is a durable snapshot of execution state that makes resume, replay, time travel, and debugging possible. Treat checkpoints as part of the application data model whenever users depend on long-running or interruptible workflows.

Thread identifiers should be stable, tenant-safe, and meaningful enough for operations. If the same user has multiple workflows, each should have a distinct thread or namespace strategy. Accidentally reusing thread IDs can mix state between tasks, which is both confusing and dangerous.

Checkpoint data needs retention rules. Some workflows need short-lived state; others need audit history. Sensitive state should be minimized, encrypted where appropriate, and deleted according to policy. Durable execution does not mean keeping everything forever.

Use time travel and replay for debugging, but understand side effects. Replaying a node that sends email or updates a ticket can be unsafe unless side effects are isolated, idempotent, or mocked during replay.

Treat checkpoints as durable workflow state.
Use safe thread ID strategies.
Define retention and deletion policies.
Protect sensitive state in checkpoint storage.
Isolate side effects from replay-sensitive nodes.

Memory Versus Checkpoint Persistence

Checkpoint persistence and user memory solve different problems. Checkpoints let a run continue. Memory helps future runs use relevant facts or preferences. Do not confuse a checkpoint with a long-term memory store. A checkpoint may contain temporary observations that should not influence unrelated future tasks.

When you add long-term memory to LangGraph, make memory retrieval an explicit node or task. That node should apply relevance, permission, recency, and confidence filters. It should also record which memories entered state so later debugging can explain the answer.

Memory updates should happen at controlled points, often after a successful run or explicit user confirmation. Storing memory mid-run can preserve false assumptions from incomplete work.

Schema evolution affects both checkpoints and memory. If state or memory structures change, write migration logic or compatibility adapters. Otherwise old runs and old memories will break under new code.

Use checkpoints for current run continuity.
Use memory stores for cross-run reusable facts.
Retrieve memory through explicit, traceable nodes.
Update memory conservatively after confirmation or success.
Plan migrations for state and memory schema changes.

Checkpoint Safety Exercise

Create a run, pause it, inspect the checkpoint, resume it, then replay from an earlier state. This hands-on exercise makes durable execution concrete. It also reveals whether state fields are understandable outside the code path that produced them.

Next, simulate a process crash after a node completes and before the user sees the result. The graph should resume or report failure without losing important state. If it repeats a side effect, the node boundary needs redesign.

Finally, review retention. Checkpoints may contain sensitive user input, retrieved context, and tool outputs. Decide what must be retained, what can expire, and what should be redacted.

Practice pause, resume, and replay.
Crash-test side-effect boundaries.
Inspect checkpoint readability.
Define retention and redaction rules.

Thread Persistence

A compiled graph with a checkpointer saves a state snapshot at each superstep under a thread ID. The latest snapshot exposes channel values, next nodes, configuration, metadata, parent checkpoint, and task or interrupt information. Use `get_state` for the current snapshot and `get_state_history` when diagnosing how the thread reached it.

Checkpoints are thread-scoped working history. The Store interface serves information that must be shared across threads, such as an explicitly saved user preference. Keep those lifecycles separate and authorize both thread access and store namespaces. Do not use a predictable thread ID as the only access control.

Replay from a prior checkpoint skips work before that point and re-executes later nodes, including external calls and interrupts. `update_state` creates a new checkpoint and applies normal reducer behavior; it does not rewrite history. Design side effects to survive replay, encrypt sensitive persistence, cap retention, and test serializer compatibility before deploying new state types.

Monitor checkpoint count and serialized size per thread. Archive or delete completed threads according to policy, compact unbounded message channels, and verify that removal reaches indexes, caches, and backups as required. Persistence without a lifecycle becomes a cost and privacy problem.

Scope every checkpoint operation to an authorized thread.
Use Store only for deliberate cross-thread memory.
Expect nodes after a replay point to execute again.
Version, encrypt, retain, and delete persisted state intentionally.

Checkpoint and Memory Examples

Beginner Example: Persist a Simple Thread with InMemorySaver

This is the smallest persistence example worth learning: compile with a checkpointer and invoke using a `thread_id`.

Beginner Example: Persist a Simple Thread with InMemorySaver

from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import InMemorySaver

class CounterState(TypedDict):
    count: int

def increment(state: CounterState) -> dict:
    return {"count": state["count"] + 1}

builder = StateGraph(CounterState)
builder.add_node("increment", increment)
builder.add_edge(START, "increment")
builder.add_edge("increment", END)

graph = builder.compile(checkpointer=InMemorySaver())
config = {"configurable": {"thread_id": "counter-1"}}
print(graph.invoke({"count": 0}, config=config))

Persistence is attached at compile time through the checkpointer.
The `thread_id` identifies which execution history to use.
This pattern is the foundation for conversation memory and interrupts.

Intermediate Example: Resume an Interrupted Review

Interrupts rely on persisted checkpoints so the graph can pause and later continue from the same thread.

Intermediate Example: Resume an Interrupted Review

from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.types import interrupt, Command

class ApprovalState(TypedDict):
    draft: str
    approved: bool

def human_gate(state: ApprovalState) -> dict:
    approved = interrupt("Approve this draft?")
    return {"approved": bool(approved)}

Without a checkpointer, the graph cannot reliably pause and resume.
Resuming uses the same thread identity.
The resumed value becomes the return value of `interrupt()` inside the node.

Advanced Example: Separate Long-Term Memory From Run State

Keep persistent application facts outside the graph state and inject only what the current run needs.

Advanced Example: Separate Long-Term Memory From Run State

class SupportState(TypedDict):
    customer_id: str
    preferences: dict
    latest_request: str
    reply: str

def load_customer_profile(state: SupportState) -> dict:
    # Pretend this reads from a durable store or database.
    profile = {"language": "en", "refund_tier": "gold"}
    return {"preferences": profile}

This is long-term memory usage, not checkpointing.
The graph pulls durable facts into state when needed.
That keeps thread state lean while still enabling personalization.

Before you move on