Field study 03
DLFT.AI / RESILIENCE 2026 — Q2

Built to fail.
Designed to heal.

The platform expects servers to die. It just doesn't notice when they do.

Most operations stacks fall over when something crashes. Ours runs hundreds of jobs a minute, survives the kind of failures that would silently bleed money elsewhere, and finishes the bookkeeping by the time you reload the page.

Live · self-healing · zero on-call
Live demo · auto loop iNormal day
queue job · 4711 job · 4712 job · 4713 job · 4714 server rebooted A pipeline run · 5 stages Searchi. Filterii. Trackiii. Auditiv. Reportv. RESUMED FROM iii. A campaign · auto-killed linen-suit-eu €42 spent · 0 sales DECIDED PAUSED RECEIPT [RECOVERED] 47 STUCK JOBS · CLEARED dlft.ai/dashboard Operations · this hour ● ALL SYSTEMS NOMINAL REVENUE · 1H €4,218 ↑ 12% CAMPAIGNS LIVE 23 14 winning PIPELINE · 24H 208/208 0 dropped UPTIME · LAST 60 MIN ↑ crash · auto-recovered in 4.2s RECENT EVENTS pipeline #208 completed · 5/5 stages campaign #4811 paused · sub-target ROAS server rebooted · 47 stuck jobs cleared 6 keywords promoted · learning loop Nothing broken. Nothing duplicated. Nothing leaked.
Mean recovery
4.2s
From server crash to a fully restored audit trail. Faster than a page refresh.
Money leaked to crashes
€0
No silent ad-spend gaps, no missed pauses, no orphaned receipts. Verifiable in the audit log.
Manual interventions
0
No on-call rotation. No 3 a.m. pages. The system fixes itself before anyone notices.
Decisions per hour
12k+
Per active brand. Pipeline runs, ad evaluations, scrape rounds, AI calls — all surviving the same chaos.

"We treat crashes the way an airline treats turbulence: expected, accounted for, invisible to passengers."

— DLFT engineering principle № 02
What this means in practice
A normal stack, on a bad day
  • A scrape spikes memory. Server dies.
  • An ad campaign is paused on Facebook — but the database never recorded it.
  • Stuck jobs block the next schedule from firing.
  • Someone gets paged at 3 a.m. to clean up.
  • Money leaks for hours before anyone notices.
DLFT, on the same bad day
  • Server dies. A fresh one boots in seconds.
  • The half-finished pause is finished automatically — receipt stamped recovered.
  • Stuck jobs are cleared on boot, schedule fires on time.
  • Nobody is paged. The next dashboard refresh shows green.
  • You read about it later — if you read about it at all.
For engineers · how it actually works
Job claim
Postgres-native queue. Workers atomically claim rows with SELECT … FOR UPDATE SKIP LOCKED. No external broker, no Redis. Two workers can never claim the same job.
Boot amnesty
On every container boot, jobs left in running for > 4h are marked failed. Eliminates ghost rows from SIGKILL'd previous containers without manual cleanup.
Three-phase kill
Every campaign pause is a write-ahead log: intent → API call → commit, all in one transaction at commit time. If the worker dies between steps 2 and 3, recoverPendingKills() finishes the bookkeeping on the next boot. Meta's graphPost(PAUSED) is idempotent — re-pausing is safe.
Phase-level resume
Multi-step pipelines store per-phase status. A new container picks up at the exact phase that was interrupted, not from the start. Resume counter caps at 3 — a deterministic crash auto-disables the schedule instead of looping forever.
Worker isolation
Heavy work (headless Chrome scrapes) runs in dedicated worker services per kind, sharing one Postgres but isolated V8 heaps. The web service stays responsive while a scrape spikes memory.
Observability
Every state transition is one row in BackgroundJobLog, PipelineStepLog, or MetaAutoKillLog. Replayable. Queryable. No lost incidents.

Resilience is the unglamorous half of autonomy. We built both.

← back to dlft.ai