Crash Recovery
Crash recovery is the core durability guarantee. Reflow resumes workflows from the last completed step — if your process dies after step 2 of 5, a later engine instance reclaims the run and continues at step 3. Completed steps are never re-executed.
// Process crashes here after 'charge' completed but before 'fulfill'.
// On restart, the engine claims the run and skips 'charge'.
await engine.start()How it works
- Per-step checkpointing. Each step's output is persisted before the next step runs. Completed steps are reused, not re-run.
- Leases. Claiming a run takes a lease valid for
runLeaseDurationMs. While executing, the engine heartbeats the lease. - Stale reclamation. If a worker crashes, it stops heartbeating. Once the lease expires, any engine's next
tick()can reclaim the still-runningrun and resume it.
This is why a stopped or crashed engine leaves in-flight runs in the running state rather than failing them — that state is what makes them reclaimable.
Make steps idempotent
A step may run more than once across recoveries — for example, if the process dies after a side effect but before its result is persisted. Design steps so a repeat is safe:
- Use idempotency keys on external calls (Stripe, your own APIs).
- Prefer upserts over blind inserts.
- Check-then-act against external state when you can.
Parallel branches follow the same rule, and recovery is per-branch: already-completed branches are skipped on resume.
Graceful shutdown
engine.stop() aborts in-flight steps (via their signal) and leaves their runs running so another instance — or the same one on restart — can reclaim them. It does not mark them failed.