The platform expects servers to die. It just doesn't notice when they do.
Most operations stacks fall over when something crashes. Ours runs hundreds of jobs a minute, survives the kind of failures that would silently bleed money elsewhere, and finishes the bookkeeping by the time you reload the page.
"We treat crashes the way an airline treats turbulence: expected, accounted for, invisible to passengers."
SELECT … FOR UPDATE SKIP LOCKED. No external broker, no Redis. Two workers can never claim the same job.running for > 4h are marked failed. Eliminates ghost rows from SIGKILL'd previous containers without manual cleanup.intent → API call → commit, all in one transaction at commit time. If the worker dies between steps 2 and 3, recoverPendingKills() finishes the bookkeeping on the next boot. Meta's graphPost(PAUSED) is idempotent — re-pausing is safe.BackgroundJobLog, PipelineStepLog, or MetaAutoKillLog. Replayable. Queryable. No lost incidents.Resilience is the unglamorous half of autonomy. We built both.
← back to dlft.ai