Why do AI pilots break in production?

Three reasons, in order of frequency. First, the production input distribution is wider than the pilot: users do things the pilot did not test for, and the model handles them badly. Second, cost and latency surprises hit at scale that did not show at pilot volume. Third, the non-AI fallback was never built, so when the model breaks the workflow has no graceful degradation. The pilot-to-production engineering work is what handles all three.

How long should the pilot-to-production process take?

Six to ten weeks for a focused AI feature with a clear input, output, and success metric. Faster usually means the eval harness or the fallback got skipped, which compresses the timeline now and costs more later. Slower usually means the original pilot scope is being expanded under the cover of 'productionization'; call that out and re-scope explicitly.

Do we need a full eval harness for every AI feature?

Yes, scaled to the workflow's criticality. A low-stakes internal tool needs a small eval set (50 to 100 examples) and a manual quality check. A customer-facing or revenue-touching feature needs a richer eval (300 to 1,000 examples), automated metrics, and human review of a sample on every model change. The principle is the same; what changes is the budget you spend building it.

What is the most common production failure mode we should expect?

Quality drift in months four to seven. The model performs at launch, the prompt or retrieval index slowly diverges from the workflow's evolving needs, nobody is closely monitoring, and by month nine the feature has become unreliable enough that the team stops using it. The countermeasure is monthly eval-against-baseline, surfaced on a dashboard the team actually looks at. It is unglamorous and it is the difference between AI features that live and ones that die.

Who should own the AI feature after launch?

A named engineer on the client team, supported by the original product owner. Not the AI vendor, not the consulting firm that built it, not 'the team'. Unowned AI is the same risk class as unowned production code: it works until it does not, and when it does not, nobody knows. The handoff document names the owner explicitly, and the on-call rotation includes them from day one.

From ChatGPT pilot to production AI: the engineering steps founders skip

The premise

Moving an AI feature from pilot to production is the work of turning a demo that impressed leadership into a system that survives a year of real users, real edge cases, and real cost pressure. It is where most AI projects quietly fail, not because the model is wrong, but because the seven engineering steps between a working prototype and a deployed feature get skipped or compressed.

The pilot worked. The founder ran it on their laptop, asked it five questions, got five good answers, and showed it to the board. Funding follows. Three months later, the team has a Slack channel full of complaints, a vendor bill that is three times the projection, and a feature that the support team has started routing around. The gap between the pilot and the production deployment is where the project went off the rails, and the gap is predictable.

This piece walks through the seven steps, in order, with the failure modes at each one. It is written for founders and engineering leads who have a working AI prototype and want it to land in production as a feature their team can defend. The framing is opinionated; the steps are not optional.

The seven steps

From demo to deployment, in the order they need to happen

Each step exists because skipping it is what makes the deployment fail.

Step one: define the production failure modes. A demo only has to work; a production feature has to fail well. What does the feature do when the model is slow, when the model is wrong, when the input is malformed, when the user behaves adversarially? Most pilots have no answer; production features need an answer for each. Step two: build the evaluation harness. A frozen dataset of 100 to 500 representative inputs, the metrics that matter, the threshold below which the feature is disabled. Until the eval exists, the model can change but you cannot tell whether the change was an improvement.

Step three: cost and latency budgets. What is the per-request cost ceiling, the p95 latency budget, the monthly spend cap? If these are not specified, the feature will silently exceed all three by month two. Step four: guardrails at the boundary. PII redaction on input, prompt-injection detection, output filtering for the policy categories that apply, refusal taxonomy for the cases the model should not handle. The pilot did none of this and got away with it because the only user was the founder. Step five: the non-AI fallback. Every AI-assisted workflow needs a non-AI path the business can return to within minutes when the model breaks, drifts, or gets priced out of reach. The fallback is not a UX dialog; it is a working manual process.

Step six: observability. Per-request logging of inputs, outputs, latency, cost, and the eval score where applicable. Without these, the team is debugging blind. Step seven: the handoff. Documentation, runbooks, the eval set, the dashboard, the on-call rotation. The feature is not in production until the team that will operate it can do so without the team that built it. Most of the cost overruns we see come from skipping step seven: the build team becomes the permanent operations team, and the unit economics shift accordingly.

Fig.: From demo to deployment, in the order they need to happen

The steps founders skip

Eval, fallback, and observability, every time

Across the projects we have rescued, three of the seven steps are skipped almost every time: the evaluation harness, the non-AI fallback, and the observability layer. The eval gets skipped because it feels like overhead: the model works on the inputs the team has tried, and a frozen test set 'is for later'. Then the team needs to change the prompt, or swap the model, or add a context source, and they have no way to know whether the change made the feature better or worse. Most prompt-engineering disasters in production AI are eval-discipline disasters, not prompt disasters.

The non-AI fallback gets skipped because it feels pessimistic: the team has just built the AI feature, the last thing they want to think about is the world where it does not work. Then six months in, the model provider has a partial outage, or the cost has tripled, or the model has been deprecated, or the regulatory environment changes, and the business has no fallback. The cost of the outage is what the fallback would have cost to build, three times over.

Observability gets skipped because the pilot did not need it. The only user was the founder; the founder remembered what they typed. In production, the team will need to debug a complaint that came in three days ago, about a workflow that touched eight inputs, none of which were logged. The team will spend a week trying to reproduce the bug from memory and fail. The retrofit of observability is more expensive than building it in.

Fig.: Eval, fallback, and observability, every time

What production-ready actually looks like

The shipping checklist, line by line

A production-ready AI feature has, at minimum, the following: a frozen evaluation dataset committed to the repo; the eval harness running on every prompt or model change; latency, cost, and quality dashboards reviewed weekly; PII redaction and prompt-injection detection at the boundary; a documented non-AI fallback with a tested switch-over procedure; per-request logging with retention sized to the longest expected debug window; a runbook for the on-call engineer; and a documented owner who is responsible for the feature's metrics at month twelve.

Each item exists because we have seen the failure that happens when it is missing. Each item also costs less to build than the cost of the failure. The economics are not subtle: the team that ships these seven things spends a few additional weeks at launch and saves a few additional quarters of debugging and rebuilding.

We refuse to deploy AI without the checklist. Not because we want to look thorough, but because the alternative is a deployment that the client cannot maintain once we leave, which is not a deliverable. The handoff includes every item on the checklist, version-controlled, with the documentation written for the engineer who will inherit it.

Fig.: The shipping checklist, line by line

Before / after

What the seven steps actually change

Four representative shifts we have seen when the gap between pilot and production is closed deliberately. Each one is from a real engagement; the numbers are the engagement's, not benchmarks.

Before

Pilot AI feature works for the founder, breaks for the support team in three different ways the founder did not anticipate. Team loses confidence in two weeks.

After

Production deployment ships with documented failure modes, an eval harness, and a non-AI fallback. The support team uses it daily by month two; the fallback is tested quarterly and never needed in production. Pattern from Beauty's reminder engine.

Takeaway · Production AI fails for users in ways the pilot did not, on a regular schedule. Plan for it explicitly.

Before

Monthly model bill at the pilot stage is small. Production deployment doubles user count, retrieval calls quadruple, the bill goes from $200 to $4,800 in a quarter. Finance is unhappy.

After

Cost ceiling per request set at deployment. Caching layer caps repeated retrievals. Monthly review tracks cost-per-feature against the budget. Bill grows linearly with usage, not super-linearly. Predictable economics replace surprise overages.

Takeaway · Cost is an engineering output, not a finance surprise. Specify the budget before the deploy, not after.

Before

A bug report says the AI feature gave a wrong answer to a customer three days ago. Nobody logged the input, output, or model version. The team spends a week trying to reproduce.

After

Per-request logging captures inputs, outputs, latency, and model version with 30-day retention. The bug is reproduced in fifteen minutes. The fix lands the same day.

Takeaway · Observability is the difference between debugging and guessing. Build it in.

Before

Model provider deprecates the version the deployment runs on with 60 days notice. The team has no eval, no abstraction layer, and no time. Quality drops when they swap models under pressure.

After

Model interface is abstracted at deploy. Eval harness runs against the new model before the swap. Quality is verified before traffic moves. The deprecation is a configuration change, not a fire.

Takeaway · Models are commodities. Treat them as commodities at the architecture stage and the swaps become routine.

Fig.: What the seven steps actually change

How SDEN ships pilots to production

Three commitments on every deployment

The pilot-to-production gap is where AI projects fail. The commitments below are how we close it.

Eval before deploy

A frozen evaluation dataset, the metrics that matter, and the threshold below which the feature is disabled. Committed to the client's repo. The eval is the contract between the model and the workflow.

Fallback that actually works

Every AI-assisted workflow has a non-AI fallback the business can return to within minutes. We test it quarterly. It exists for the day the model breaks, and that day always comes.

Handoff is the deliverable

Documentation, runbooks, dashboards, on-call rotation. The feature is not in production until your team can operate it without ours. An AI feature you cannot maintain without us is a dependency, not a deliverable.

What good looks like

A year after the deploy

The right test of a production AI deployment is what it looks like twelve months later, not on launch day.

The deployments that age well share three properties. The eval set has been updated at least twice as the workflow evolved, not abandoned. The fallback has been tested at least once for real, even if the model never broke, confirming that the path still works. The team that operates the feature is not the team that built it; the handoff actually transferred ownership.

The deployments that fail share the opposite three. The eval set is six months out of date, because nobody owns it. The fallback exists in documentation only, untested. And the original engineering team is still the de facto support team, because the documentation never made it possible for anyone else to take over.

When SDEN finishes an AI engagement, the handoff is what the deliverable is judged on. The feature works on day one; that is table stakes. The feature is still working on day three hundred and sixty-five, owned by your team: that is the engagement landing.

Fig.: A year after the deploy

FAQ

AI engineering:
questions we get asked.

Direct answers to the questions we get asked the most. If yours isn't covered, write to the team.

Contact the team

From ChatGPT pilot to production AI: the engineering steps founders skip

From demo to deployment, in the order they need to happen

Eval, fallback, and observability, every time

The shipping checklist, line by line

What the seven steps actually change

Three commitments on every deployment

Eval before deploy

Fallback that actually works

Handoff is the deliverable

A year after the deploy

AI engineering:
questions we get asked.

Related on SDEN

Custom AI workflows vs off-the-shelf tools: when each one wins

AI ROI for founders: measuring what AI is actually worth

AI & Machine Learning expertise

Got a project worth building?

From demo to deployment, in the order they need to happen

Eval, fallback, and observability, every time

The shipping checklist, line by line

What the seven steps actually change

Three commitments on every deployment

Eval before deploy

Fallback that actually works

Handoff is the deliverable

A year after the deploy

AI engineering:questions we get asked.

Why do AI pilots break in production?

How long should the pilot-to-production process take?

Do we need a full eval harness for every AI feature?

What is the most common production failure mode we should expect?

Who should own the AI feature after launch?

Related on SDEN

Custom AI workflows vs off-the-shelf tools: when each one wins

AI ROI for founders: measuring what AI is actually worth

AI & Machine Learning expertise

Got a project worth building?

AI engineering:
questions we get asked.