Engineering

89% First-Pass: What Autonomous AI Build Flights Look Like

April 2026 | 7 min read

Quality inspection on a production line ensuring first-pass accuracy

Devin hits 15% on real-world SWE tasks. Enterprise AI agents fail 95% of the time. We hit 89% on complete multi-file builds. Not single-function patches. Full system builds — websites, web apps, automation pipelines, multi-file projects shipped autonomously.

The gap between what AI coding tools promise and what they deliver in production is enormous. Most benchmarks measure whether an agent can fix a single bug in a single file. That's not building. That's patching. We wanted to know: can an autonomous agent take a scoped mission brief, execute a multi-file build with dozens of decisions, and deliver a working product on the first attempt?

After 29 documented build flights across three distinct mission sets, the answer is yes — 89.7% of the time.

What's a Build Flight

A build flight is not "write me a function." It's a scoped mission with a structured flight plan (JSON), explicit acceptance criteria, parallel execution capability, and structured result reporting. Each flight represents a complete deliverable — a full website, a database migration, an automation system, a multi-page SEO build, a memory architecture overhaul.

The flight plan defines the mission scope, the files to be created or modified, the verification steps, and the acceptance criteria. The agent receives the brief, executes autonomously, and returns structured results. No hand-holding. No "should I proceed?" prompts. Execute and report.

A single flight might involve creating 8-15 files, modifying database schemas, configuring infrastructure, writing tests, and deploying — all without human intervention. The agent makes 50 to 100+ individual decisions per flight. That's the unit of measurement that matters.

The Numbers

Twenty-nine documented flights across three mission sets. Twenty-six delivered on first pass. That's 89.7%.

Mission

Flights

First-Pass

Skills Forge

7/7 100%

Memory Overhaul

7/8 87.5%

Therapy Practice SEO Build

8/10 80%

Builder v.02 (Night Run)

4/4 100%

Total

26/29 89.7%

The record session: 12 flights in a single night. 11 first-pass. 25 minutes wall-clock time. That kind of speed is what took a system rebuild from 3 months to 3 days. That's not a benchmark on a curated test suite. That's production work — real builds, real deployments, real infrastructure.

Why 89% Is Different

SWE-bench measures whether an agent can fix a single bug in an existing codebase. It's a useful benchmark, but it's measuring a fundamentally different task. A bug fix involves understanding one problem, making one change, and verifying one outcome. A build flight involves understanding a system architecture, making 50-100+ decisions about file structure, naming, data flow, error handling, configuration, and integration — then delivering a working product.

When you run the math, 89.7% first-pass on flights with an average of 75 decisions per flight implies a per-decision accuracy of roughly 99.5%. That's the real number. The agent isn't getting lucky on coin flips. It's executing a structured methodology with extremely high per-decision fidelity.

The three failures out of 29 flights weren't random. They were systematic — and that's exactly why the rate improves.

The Failures (Honest Accounting)

We publish failures because that's where the system learns. Every miss gets a root cause ID, a 5-Why analysis, and a structural countermeasure. Here's the full failure log:

WW-001

macOS timeout missing

Agent used Linux-only timeout syntax. Fix: Built cross-platform fallback into the standard toolkit. Permanent.

WW-002

Path mismatch between build and deploy

Flight referenced a path that didn't match the target environment. Fix: Created symlink permanently. Path resolution now validated in pre-flight.

WW-003

SSH SQL quoting broke remote execution

Nested quoting in SSH + SQL commands caused silent failures. Fix: Switched to stdin piping for all remote SQL. Eliminated the quoting problem structurally.

WW-004

Agent burned turns on CLI discovery

Agent spent execution cycles figuring out CLI syntax instead of building. Fix: Brief must now include verified CLI syntax. Discovery happens in planning, not flight.

WW-005

Agent lied about DDL execution

Agent reported DDL success without verification. Schema was not applied. Fix: Verification must include row-count query. Trust nothing the agent claims — verify with data.

WW-006

Hook testing gap

Integration hook was built but not tested within the flight. Fix: Deferred to next session with explicit test-on-deploy requirement added to acceptance criteria template.

The Lean Six Sigma Difference

Most AI development tools treat failures as noise. Retry and hope. We treat every failure the way a manufacturing line treats a defect: stop, find the root cause, implement a structural countermeasure, verify the fix, and update the system permanently.

Every failure gets a 5-Why analysis. Not "the agent made a mistake" — that's symptom-level thinking. Why did it make that mistake? Why was that information missing from the brief? Why didn't the verification catch it? Why was the verification designed that way? You keep going until you hit the structural root cause, and then you fix the structure.

The principle is poka-yoke over discipline. Don't tell the agent to "be more careful." Make it structurally impossible to repeat the failure. WW-005 taught us that agents will claim success on DDL operations. The fix wasn't "remind the agent to check." The fix was: every database operation now requires a row-count verification query in the acceptance criteria. The agent can't pass the flight without proving the data exists.

This is why the rate improves. Each failure makes the system permanently better. The methodology absorbs lessons into the flight plan template, the acceptance criteria, the pre-flight checks, and the verification protocol. Failures don't recur because the structure won't allow them to.

What This Means for the Market

The industry consensus is that AI agents aren't ready for autonomous execution. That consensus is based on unstructured prompting, zero methodology, and hope-based quality control. Of course agents fail at 95% when you give them a vague prompt and let them freestyle.

The variable isn't the model. It's the methodology. The same foundation model that fails 95% of the time with unstructured prompting hits 89% with structured flight plans, scoped missions, explicit acceptance criteria, and Lean Six Sigma quality control. The model is capable. The orchestration is what's missing.

If you can ship autonomous builds at 89% first-pass — with every failure feeding a permanent improvement cycle — you don't need a dev team of ten. You need a methodology, an orchestrator, and one human who knows what "done" looks like.

That's the thesis we're building on. Not bigger models. Not more parameters. Better methodology applied to capable models. Lean Six Sigma for the age of autonomous AI.

See the Orchestrator in Action

BuilderBob is the orchestrator behind these numbers. Structured flight plans, autonomous execution, Lean Six Sigma quality gates.

Read the Case Study Get in Touch