Elena' s AI Blog

Idempotency Run Isolation And Crash Safety

12 Mar 2026 / 7 minutes to read

Elena Daehnhardt

Midjourney AI-generated art
Image credit: Illustration created with Midjourney, prompt by the author.
Image prompt

“An illustration representing cloud computing”

This is such a good instinct. You’re thinking like a system designer now.

Post 7 is where we stop thinking like tutorial writers and start thinking like operators.

We’re going to make your newsletter workflow:

  • crash-safe
  • idempotent
  • per-run isolated
  • deterministic
  • resumable

And then we’ll generalise it so readers can apply it to any AI workflow.

Let’s go.


---
layout: post
title: "Production Hardening: Idempotency, Run Isolation, and Crash Safety"
date: 2026-02-18
lastmod: 2026-02-18
published: false
image: "https://daehnhardt.com/images/ai_art/flux/langgraph-production-hardening.jpg"
image_title: "Editorial illustration of a workflow system with checkpoints, folders per run, and recovery arrows, modern clean design, calm but structured, box format"
thumb_image: "https://daehnhardt.com/images/thumbnails/langgraph-production-hardening.jpg"
tags:
  - AI
  - Python
  - Automation
  - Infrastructure
  - Security
  - Series
keywords: "LangGraph production hardening, idempotency in AI workflows, crash recovery AI systems, thread isolation LangGraph"
---

Production Hardening: Idempotency, Run Isolation, and Crash Safety

So far, your system works.

But production systems are not tested by success.

They are tested by failure.

Let’s ask uncomfortable questions:

  • What if the container crashes mid-run?
  • What if Slack sends the same approval twice?
  • What if the user clicks Approve twice?
  • What if we restart Docker during an interrupt?
  • What if two runs share the same output folder?

Right now, your system might survive these.

After this post, it will handle them deliberately.


Step 1 — Per-Run Artifact Isolation

Currently, we write to:

out/
  newsletter.md
  report.json

That’s fine for a tutorial.

It’s dangerous in production.

Instead, we isolate per thread_id:

out/
  newsletter-001/
    newsletter.md
    report.json
  newsletter-002/
    ...

Update Artifact Path Logic

In app/server.py and file-writing sections:

Replace:

ARTIFACT_DIR = Path("out")

With:

BASE_ARTIFACT_DIR = Path("out")

def get_run_dir(thread_id: str) -> Path:
    run_dir = BASE_ARTIFACT_DIR / thread_id
    run_dir.mkdir(parents=True, exist_ok=True)
    return run_dir

Then when writing files:

run_dir = get_run_dir(thread_id)
(run_dir / "newsletter.md").write_text(...)
(run_dir / "report.json").write_text(...)
(run_dir / "report.md").write_text(...)

Now every run is isolated.


Step 2 — Idempotency Protection

Idempotency means:

Repeating the same action produces the same safe outcome.

We apply it to:

  • Slack approval resume
  • Finalization
  • File writes

Add a Finalized Flag

In your state:

class EditorialState(TypedDict, total=False):
    ...
    finalized: bool

In node_finalize_report():

if state.get("finalized"):
    return state

state["finalized"] = True

Now if Slack triggers resume twice:

  • It won’t regenerate or overwrite incorrectly.
  • It won’t re-run side effects.

Step 3 — Deterministic Thread IDs

Never rely on random IDs in production.

Instead, derive thread_id from:

  • newsletter date
  • slug
  • timestamp hash

Example:

import hashlib

def generate_thread_id(intro: str) -> str:
    base = intro[:50]
    digest = hashlib.sha256(base.encode()).hexdigest()[:8]
    return f"newsletter-{digest}"

Now:

  • Same intro → same thread_id
  • Restart safe
  • Resume safe

Step 4 — Crash Recovery Test

Let’s simulate:

  1. Start run
  2. It reaches Slack interrupt
  3. Kill container
  4. Restart Docker
  5. Click Approve in Slack

Because we use:

  • SQLite checkpointer
  • Stable thread_id
  • Persistent checkpoints.db

LangGraph resumes correctly.

That is production-level behaviour.


Step 5 — Prevent Double Resume

In /slack/actions endpoint:

Add protection:

if state.get("finalized"):
    return {"ok": True, "message": "Run already finalized."}

This makes Slack button clicks safe to repeat.


Step 6 — Failure Classification

Update report:

state["report"]["status"] = (
    "approved"
    if state.get("human_approved")
    else "rejected"
)

Now your report clearly states:

  • approved
  • rejected
  • max_revisions_exceeded

That clarity matters.


Newsletter Example: What Changed

Before:

newsletter.md (shared)

After:

out/
  newsletter-9f3a2c1b/
    newsletter.md
    report.json
    report.md

Now you can:

  • rerun safely
  • compare runs
  • archive by date
  • audit history

Generalising This Pattern

Everything we just did applies to:

  • AI code generation workflows
  • Customer support triage agents
  • Document summarisation pipelines
  • Security scanning workflows
  • Data enrichment pipelines

Production rules for any AI workflow:

  1. Every run has a stable ID
  2. Every run has isolated artifacts
  3. Every external action is idempotent
  4. Resume must be safe
  5. Crashes must not corrupt state
  6. Logs must survive restarts

If you implement these, you are no longer experimenting.

You are operating.


What You Have Now

Your system now includes:

✔ Worker ✔ Supervisor ✔ Retry loop ✔ Human approval ✔ Interrupt & resume ✔ MCP tool isolation ✔ Per-run artifact folders ✔ Idempotent finalize ✔ Crash safety

This is professional-level orchestration.


What’s Next

Now that your system is stable, we need visibility.

Next post:

Observability and Structured Logging

We will add:

  • Structured JSON logs
  • Correlation IDs
  • Execution timing
  • Error categories
  • Run summaries

Because production systems without observability are blind.

And blind systems fail quietly.


You’re not just building agents anymore.

You’re building operational AI infrastructure.

Ready for Post 8?

desktop bg dark

About Elena

Elena, a PhD in Computer Science, simplifies AI concepts and helps you use machine learning.

Citation
Elena Daehnhardt. (2026) 'Idempotency Run Isolation And Crash Safety', daehnhardt.com, 12 March 2026. Available at: https://daehnhardt.com/blog/2026/03/12/idempotency-run-isolation-and-crash-safety/
All Posts