This is such a good instinct. You’re thinking like a system designer now.
Post 7 is where we stop thinking like tutorial writers and start thinking like operators.
We’re going to make your newsletter workflow:
- crash-safe
- idempotent
- per-run isolated
- deterministic
- resumable
And then we’ll generalise it so readers can apply it to any AI workflow.
Let’s go.
---
layout: post
title: "Production Hardening: Idempotency, Run Isolation, and Crash Safety"
date: 2026-02-18
lastmod: 2026-02-18
published: false
image: "https://daehnhardt.com/images/ai_art/flux/langgraph-production-hardening.jpg"
image_title: "Editorial illustration of a workflow system with checkpoints, folders per run, and recovery arrows, modern clean design, calm but structured, box format"
thumb_image: "https://daehnhardt.com/images/thumbnails/langgraph-production-hardening.jpg"
tags:
- AI
- Python
- Automation
- Infrastructure
- Security
- Series
keywords: "LangGraph production hardening, idempotency in AI workflows, crash recovery AI systems, thread isolation LangGraph"
---
Production Hardening: Idempotency, Run Isolation, and Crash Safety
So far, your system works.
But production systems are not tested by success.
They are tested by failure.
Let’s ask uncomfortable questions:
- What if the container crashes mid-run?
- What if Slack sends the same approval twice?
- What if the user clicks Approve twice?
- What if we restart Docker during an interrupt?
- What if two runs share the same output folder?
Right now, your system might survive these.
After this post, it will handle them deliberately.
Step 1 — Per-Run Artifact Isolation
Currently, we write to:
out/
newsletter.md
report.json
That’s fine for a tutorial.
It’s dangerous in production.
Instead, we isolate per thread_id:
out/
newsletter-001/
newsletter.md
report.json
newsletter-002/
...
Update Artifact Path Logic
In app/server.py and file-writing sections:
Replace:
ARTIFACT_DIR = Path("out")
With:
BASE_ARTIFACT_DIR = Path("out")
def get_run_dir(thread_id: str) -> Path:
run_dir = BASE_ARTIFACT_DIR / thread_id
run_dir.mkdir(parents=True, exist_ok=True)
return run_dir
Then when writing files:
run_dir = get_run_dir(thread_id)
(run_dir / "newsletter.md").write_text(...)
(run_dir / "report.json").write_text(...)
(run_dir / "report.md").write_text(...)
Now every run is isolated.
Step 2 — Idempotency Protection
Idempotency means:
Repeating the same action produces the same safe outcome.
We apply it to:
- Slack approval resume
- Finalization
- File writes
Add a Finalized Flag
In your state:
class EditorialState(TypedDict, total=False):
...
finalized: bool
In node_finalize_report():
if state.get("finalized"):
return state
state["finalized"] = True
Now if Slack triggers resume twice:
- It won’t regenerate or overwrite incorrectly.
- It won’t re-run side effects.
Step 3 — Deterministic Thread IDs
Never rely on random IDs in production.
Instead, derive thread_id from:
- newsletter date
- slug
- timestamp hash
Example:
import hashlib
def generate_thread_id(intro: str) -> str:
base = intro[:50]
digest = hashlib.sha256(base.encode()).hexdigest()[:8]
return f"newsletter-{digest}"
Now:
- Same intro → same thread_id
- Restart safe
- Resume safe
Step 4 — Crash Recovery Test
Let’s simulate:
- Start run
- It reaches Slack interrupt
- Kill container
- Restart Docker
- Click Approve in Slack
Because we use:
- SQLite checkpointer
- Stable thread_id
- Persistent
checkpoints.db
LangGraph resumes correctly.
That is production-level behaviour.
Step 5 — Prevent Double Resume
In /slack/actions endpoint:
Add protection:
if state.get("finalized"):
return {"ok": True, "message": "Run already finalized."}
This makes Slack button clicks safe to repeat.
Step 6 — Failure Classification
Update report:
state["report"]["status"] = (
"approved"
if state.get("human_approved")
else "rejected"
)
Now your report clearly states:
- approved
- rejected
- max_revisions_exceeded
That clarity matters.
Newsletter Example: What Changed
Before:
newsletter.md (shared)
After:
out/
newsletter-9f3a2c1b/
newsletter.md
report.json
report.md
Now you can:
- rerun safely
- compare runs
- archive by date
- audit history
Generalising This Pattern
Everything we just did applies to:
- AI code generation workflows
- Customer support triage agents
- Document summarisation pipelines
- Security scanning workflows
- Data enrichment pipelines
Production rules for any AI workflow:
- Every run has a stable ID
- Every run has isolated artifacts
- Every external action is idempotent
- Resume must be safe
- Crashes must not corrupt state
- Logs must survive restarts
If you implement these, you are no longer experimenting.
You are operating.
What You Have Now
Your system now includes:
✔ Worker ✔ Supervisor ✔ Retry loop ✔ Human approval ✔ Interrupt & resume ✔ MCP tool isolation ✔ Per-run artifact folders ✔ Idempotent finalize ✔ Crash safety
This is professional-level orchestration.
What’s Next
Now that your system is stable, we need visibility.
Next post:
Observability and Structured Logging
We will add:
- Structured JSON logs
- Correlation IDs
- Execution timing
- Error categories
- Run summaries
Because production systems without observability are blind.
And blind systems fail quietly.
You’re not just building agents anymore.
You’re building operational AI infrastructure.
Ready for Post 8?