Built to fail.
Designed to heal.

The platform expects servers to die. It just doesn't notice when they do.

Most operations stacks fall over when something crashes. Ours runs hundreds of jobs a minute, survives the kind of failures that would silently bleed money elsewhere, and finishes the bookkeeping by the time you reload the page.

Live · self-healing · zero on-call

Live demo · auto loop iNormal day

Mean recovery

4.2s

From server crash to a fully restored audit trail. Faster than a page refresh.

Money leaked to crashes

€0

No silent ad-spend gaps, no missed pauses, no orphaned receipts. Verifiable in the audit log.

Manual interventions

No on-call rotation. No 3 a.m. pages. The system fixes itself before anyone notices.

Decisions per hour

12k+

Per active brand. Pipeline runs, ad evaluations, scrape rounds, AI calls — all surviving the same chaos.

"We treat crashes the way an airline treats turbulence: expected, accounted for, invisible to passengers."

— DLFT engineering principle № 02

What this means in practice

A normal stack, on a bad day

A scrape spikes memory. Server dies.
An ad campaign is paused on Facebook — but the database never recorded it.
Stuck jobs block the next schedule from firing.
Someone gets paged at 3 a.m. to clean up.
Money leaks for hours before anyone notices.

DLFT, on the same bad day

Server dies. A fresh one boots in seconds.
The half-finished pause is finished automatically — receipt stamped recovered.
Stuck jobs are cleared on boot, schedule fires on time.
Nobody is paged. The next dashboard refresh shows green.
You read about it later — if you read about it at all.

For engineers · how it actually works

Job claim: Postgres-native queue. Workers atomically claim rows with SELECT … FOR UPDATE SKIP LOCKED. No external broker, no Redis. Two workers can never claim the same job.
Boot amnesty: On every container boot, jobs left in running for > 4h are marked failed. Eliminates ghost rows from SIGKILL'd previous containers without manual cleanup.
Three-phase kill: Every campaign pause is a write-ahead log: intent → API call → commit, all in one transaction at commit time. If the worker dies between steps 2 and 3, recoverPendingKills() finishes the bookkeeping on the next boot. Meta's graphPost(PAUSED) is idempotent — re-pausing is safe.
Phase-level resume: Multi-step pipelines store per-phase status. A new container picks up at the exact phase that was interrupted, not from the start. Resume counter caps at 3 — a deterministic crash auto-disables the schedule instead of looping forever.
Worker isolation: Heavy work (headless Chrome scrapes) runs in dedicated worker services per kind, sharing one Postgres but isolated V8 heaps. The web service stays responsive while a scrape spikes memory.
Observability: Every state transition is one row in BackgroundJobLog, PipelineStepLog, or MetaAutoKillLog. Replayable. Queryable. No lost incidents.

Resilience is the unglamorous half of autonomy. We built both.

← back to dlft.ai

Built to fail.Designed to heal.

Built to fail.
Designed to heal.