D Field study 04

Thesis Loop Resilience Trust Work Team

DLFT.AI / TRUST 2026 — Q2

When to trust AI.

Six principles, hard-won in production. Anyone can call an LLM. Putting one in production for a paying customer requires rules.

Most AI demos are interesting. Most AI products are not. The difference is everything that surrounds the model — the cost ledger, the safety gates, the prompt cache, the retry semantics, the fallback chain.

This is what we've learned running thirty-plus model integrations across one platform: ad evaluation, customer service, image generation, classification, translation, brand DNA, supplier sourcing. The model is the easy part. These rules are the rest.

The six

i.The AI writes. Code decides.trust → safety ii.Cheap models, often. Smart models, rarely.cost → economics iii.One-shot beats agent loops.reliability → ops iv.Cache the prompt, not the answer.cost → caching v.Isolate spend by purpose.cost → control vi.Treat tokens like cash.cost → ledger

Principle 01 · trust

The AI writes. Code decides.

Models are excellent at language. They are unreliable at irreversible actions.

We let the AI draft refunds, write apologies, classify intents, summarise documents, propose ad copy. Around every action that costs money or moves something we cannot un-move, we wrap a deterministic gate. The AI proposes; the rules dispose.

This is not a guard against future hallucination. It is a precondition for shipping today. A model that's right 99 percent of the time is wrong 1,200 times across 120,000 monthly customer emails. You design for the 1,200.

In our stackA multilingual fraud-pattern scorer above the customer-service AI caps refund amounts — sometimes to zero — regardless of what the model offered. The AI never decides what to spend. Where it bitesCross-store dispute history. One shopper hitting four brands with the same playbook is invisible to a single-store gate. We see all four.

"The model that's right 99 percent of the time is wrong 1,200 times a month. You design for the 1,200."

ii.

Principle 02 · cost

Cheap models, often. Smart models, rarely.

Most teams reach for the strongest model available. Token economics turn that habit into a tax that compounds.

We use Haiku for the high-volume, structured work — keyword extraction, classification, translation, draft replies, cache key fills. Sonnet handles the richer reasoning where it earns its margin. Opus is reserved for once-per-brand work where its depth is the product, not a luxury.

The discipline isn't only financial. Cheap models are faster, which means tighter feedback loops, which means more iterations, which means a better product.

Distribution of LLM calls in production last 30 days

Haiku 4.5

~78%

Sonnet 4.6

~19%

Opus 4.6

~3%

Workhorse, specialist, surgeon. Three roles, three tiers, deliberately picked per task.

"Token economics aren't a footnote. They're the design."

iii.

Principle 03 · reliability

One-shot beats agent loops.

Agentic frameworks make for great demos and unreliable production.

We don't run multi-turn agent loops or chained tool-use. Every model call has one job: take this input, return this structured output, log the cost. If a task needs five steps, it is five calls — each independently retryable, each independently observable, each independently priced.

This is the unfashionable answer. It is also why our AI bill is predictable, our latency is bounded, and our error budget belongs to us, not to a framework's internal planner.

Across the stack30+ scoped Claude integrations. Zero multi-turn agentic loops. Zero tool-use chains. Every call is one prompt, one structured response. When we do orchestrateIt happens in our code, not the model's. Brand DNA → concept → image → ad copy is a five-step pipeline of single calls — checkpointed, resumable, costed.

iv.

Principle 04 · cost

Cache the prompt, not the answer.

The system prompt is large and stable across thousands of calls. The answer is small and unique to each call.

Most teams instinctively cache the answer — and then watch their cache hit rate hover around two percent because every customer message is different. We do the inverse. The per-brand strategy text, the few-shot examples, the schema specification — all of that lives in a cached prefix. The model pays for the heavy reading once per brand, not once per customer.

The result is a reply that is both cheaper and faster than the same model called naively, with no trade-off in quality.

What we cachePer-store strategy, brand voice, few-shot replies, fraud-pattern lexicons. The prompt body — large, slow-moving, expensive to re-read. What we don'tThe customer's actual message. The model's actual reply. Variable, single-use, cheap.

"Cache what doesn't change. Pay attention to what does."

Principle 05 · control

Isolate spend by purpose.

One credit card with one limit and one shared bill is how AI budgets quietly explode.

We run separate API keys per category — one for the ads pipeline, one for customer service, one for creative generation. A runaway loop in ad-creative cannot touch the customer-service budget. A misconfigured prompt cannot starve the production pipeline. Each domain has its own wallet, its own ceiling, and its own alarms.

This is wallet-level discipline, not after-the-fact analytics. The category line is the budget line.

In practiceTwo Anthropic keys (ads, everything-else). Per-vendor keys for each image provider. Per-domain limits enforced at the wallet, not the application. Why it mattersA new feature shipping with a 10× cost regression should hit a ceiling, not a credit-card limit at 3 a.m.

vi.

Principle 06 · ledger

Treat tokens like cash.

Every model call writes a row into our cost ledger: model, input tokens, output tokens, cents, brand, feature, timestamp.

We know which features cost what, per brand, per hour. We know which models earn their keep and which don't. We know which prompt change made the bill quietly double last Tuesday. The reason we can run AI cheaply is we never stopped looking at the bill.

If you cannot answer "what did this customer cost us today" in one query, you do not have an AI product. You have an experiment.

Per call we logmodel · input tok · output tok · cached tok · cents · brand · feature · ms Reports we runCost per brand · cost per ticket · cost per conversion · cost per dropped call. Daily, in one query.

"If you cannot answer 'what did this cost us today' in one query, you have an experiment, not a product."

The AI is not the product.
The discipline around the AI is.