Why AI teams with great engineers still can't ship

I’ve sat in delivery reviews with teams full of genuinely strong engineers — people who’d clear any interview bar you set — and watched the same AI feature slip for the third month running. The instinct is to hunt for a weak link: a bad hire, a hard model, a missing tool. It’s almost never that. The engineers are fine. The system they’re shipping through isn’t.

The most expensive sentence in software

“We just need more time.”

It sounds reasonable, so nobody challenges it. But it hides the only question that matters: time to do what, exactly? On most stalled AI projects, nobody can answer that in a single sentence. “Done” was never defined — so it can’t be reached. The team isn’t slow. It’s aiming at a target that doesn’t exist.

Why AI makes it worse

In traditional software, “works” is roughly binary: the endpoint returns the right value or it doesn’t. You can feel when you’re finished.

In AI, “works” is a distribution. The demo works. The happy path works. The 90th-percentile input quietly doesn’t. Without an agreed bar — which inputs, what pass rate, which failures are acceptable — every review collapses into a vibe check. And vibe checks never converge. So the team keeps polishing a thing with no finish line. Great engineers, infinite loop.

The three things that actually block shipping

None of these are about talent. They’re properties of the system.

1. No definition of “shipped.” If you can’t write the acceptance criteria as a test, you don’t have a spec — you have a wish. For AI, “acceptance” means an eval set with a target pass rate on unseen inputs. No eval, no finish line, no ship.

2. Decision latency. The code gets written in hours. The decision to accept the tradeoff then sits in someone’s DMs for two weeks. The bottleneck is who-decides, not who-codes — and it’s almost always invisible on the board.

3. Nobody owns the last mile. The prototype has an owner. Production — monitoring, cost, rollback, the boring 20% — is everyone’s job and therefore no one’s. That 20% is the actual product.

A team can be excellent at all three of the wrong things. That’s what a strong team inside a weak system looks like: lots of motion, no shipping.

What to change (cheap, not easy)

Define “shipped” before a line of code. Acceptance criteria written as an eval, one named owner, and a date the decision-maker commits to — not the coder.
Treat evals as acceptance tests, not research. If it doesn’t clear the bar on held-out inputs, it isn’t done. No debate, no vibe check.
Track decision latency like a bug. Measure the time from “blocked” to “unblocked.” It’s usually your biggest defect, and it never shows up in a standup.

The takeaway

Your engineers are probably fine. Shipping is not a property of individual talent — it’s a property of the system they work inside: the definitions, the ownership, the speed of decisions. Fix the system and the same people start shipping. Leave it, and hiring one more great engineer just adds another person to the loop.

Systems > Emotions.