Development Analysis · Feb 3–17, 2026
106 PRs in 14 Days. One Architect.
Output equivalent to a 3–4 person elite engineering team. 85% of builders completed fully autonomously. 20 pre-merge bugs caught by multi-agent review.
Data sources: 26 review files, 106 merged PRs, 105 closed issues, 801 commits,
consult stats
Key Metrics
Team Equivalent
By PR volume alone, one architect with autonomous builders matched 11–19 individual developers against industry elite benchmarks.
Sources: LinearB 2026 (8.1M PRs), Worklytics 2025, byteiota 2026. Caveats: solo codebase, no cross-team coordination overhead, single TypeScript project.
Autonomous Builder Performance
22 of 26 builders (85%) completed fully autonomously. The 4 interventions were caused by infrastructure issues, not builder capability gaps.
The 4 Interventions
Context Recovery: 100% Success Rate
4 specs exhausted context windows (all had 40+ files or 5+ phases).
Every recovery succeeded via porch status.yaml +
git history. No builder had to redo completed work.
| Spec | Files | Context Windows |
|---|---|---|
| 0104 | 74 | 4 compactions |
| 0105 | 28 (7 phases) | 2 |
| 0112 | 124 | 3 |
| 0126 | 80+ | 3 |
Remaining 22 projects completed within a single context window.
How Fast Builders Ship
Median autonomous implementation time: 57 minutes from first commit to PR creation.
Bugfix Pipeline
Issue filed → builder spawned → fix implemented → 3-way review → merged. Fully automated end-to-end.
Multi-Agent Review Catches
20 pre-merge bugs caught: 1 security-critical, 8 runtime failures, 11 quality/completeness gaps.
Security-Critical
Socket permissions gap
Codex · Spec 0104Shellper Unix socket created without restrictive permissions (0600). Any local user could connect.
Runtime Failures (8)
| Catch | Spec | Reviewer | Description |
|---|---|---|---|
| Startup race condition | 0105 | Codex + Gemini | getInstances() returns [] before initInstances() completes, breaking dashboard on startup |
| body.name truthiness bug | 0107 | Codex | { name: "" } treated as reconnect instead of validation error |
| Nonce placement error | 0107 | Claude | OAuth nonce placed on authUrl instead of callback, breaking CSRF protection |
| Pong timeout not armed | 0109 | Codex | Dead WebSocket connections never cleaned up when ws.ping() throws |
| stderrClosed value-copy bug | 0113 | Claude | Boolean was a local copy, never updating the session object |
| Zero-padded spec matching | 0126 | Codex | getProjectSummary() failed to match spec files with leading zeros |
| Bugfix regex extraction | 0126 | Codex | Issue number extraction didn't handle all bugfix branch naming patterns |
| Missing workspacePath | 0126 | Claude | Critical: Dashboard would have been completely non-functional for workspace views |
Quality & Completeness (11)
| Catch | Spec | Reviewer |
|---|---|---|
| gate-status.ts deletion prevented (would break build) | 0108 | Gemini |
| Gate transition vs re-request spam | 0108 | Codex |
| buildWorktreeLaunchScript side effect | 0105 | Claude |
| tower.html rename miss (124-file rename) | 0112 | Gemini |
| Stale StatusPanel test assertions | 0112 | Gemini |
| Pre-HELLO gating (unauthenticated socket access) | 0118 | Codex |
| Workspace scoping for sockets | 0118 | Codex |
| Unnecessary templates.test.ts change prevented | 0111 | Claude |
| ReconnectRestartOptions missing | 0104 | Codex |
| Documentation gaps (INSTALL, MIGRATION, DEPENDENCIES) | 0104 | Gemini |
| Secondary race path in /api/state | BF #274 | Codex |
Reviewer Effectiveness
No single reviewer caught all bugs. Each model's blind spot is another's strength.
Security edge cases, test completeness, exhaustive sweeps. Found the socket permissions gap, truthiness bug, and secondary race path that others missed.
Line-by-line traceability, type safety, critical runtime bugs. Caught the two most critical issues: missing workspacePath (would have broken entire dashboard) and stderrClosed value-copy bug.
Architecture, documentation, build-breaking deletions. Caught the gate-status.ts deletion that would have broken the build and the tower.html rename miss in a 124-file change.
What Escaped Despite Review
16 post-merge escapes (8 code defects, 8 design gaps). Understanding the boundaries of multi-agent review.
| Issue | Description | Why Review Missed It |
|---|---|---|
| #294 | Shellper process leak | Resource lifecycle across process boundaries |
| #313 | Terminal unresponsive under backpressure | Flow control behavior under load |
| #324 | Processes don't survive restart | Pipe FD lifecycle dependency |
| #319 | Duplicate notifications | Event ordering in async pipeline |
| #335 | Notify before review completes | Race between notification and consultation |
| #336 | Worktree changes leak via CWD | Side effect invisible in diff review |
| #342 | Consult subprocesses never exit | Process cleanup in SDK teardown |
| #341 | Orphaned processes accumulate | Resource lifecycle across sessions |
Key Pattern
5 of 8 code-defect escapes were process lifecycle issues (shellper leaks, orphaned processes, pipe FD dependencies). These are fundamentally hard to catch via static code review — they require runtime observation of process behavior over time. Spec 0126 (80+ files, 6 phases) alone produced 7 of 16 total escapes, suggesting review effectiveness degrades with spec complexity.
Cost & ROI
| Model | Invocations | Avg Duration | Cost | Success Rate |
|---|---|---|---|---|
| Claude | 2,291 | 8s | $96.69 | 84% |
| Codex | 613 | 21s | $70.81 | 63% |
| Gemini | 211 | 64s | $1.14* | 98% |
| Total | 3,115 | 12.2h | $168.64* | 81% |
*Gemini cost tracking was broken during this period (Bug #374). Actual Gemini costs likely 10–50x higher. Corrected estimate: $180–$225 total.
| Category | Hours |
|---|---|
| Savings: Security catches (1) | ~10h |
| Savings: Runtime failures (8) | ~12h |
| Savings: Quality/completeness (11) | ~11h |
| Total Savings | ~33h |
| Overhead: False positive iterations (~36 × 5 min) | ~3.0h |
| Overhead: Consultation wait time (~200 rounds × 2 min) | ~6.7h |
| Total Overhead | ~9.7h |
| Net Value | ~23.3 hours |
| ROI | ~3.4x |
Conservative floor (halving security estimate): ~28h saved, ~18.3h net, 2.9x ROI. Cost efficiency: $8.43 per catch, $5.11 per hour of engineering time saved.
Raw Output
| PR Type | Count | Additions | Deletions | Files |
|---|---|---|---|---|
| SPIR (feature) | 30 | +54,049 | -22,049 | 614 |
| Bugfix | 59 | +11,896 | -7,407 | 410 |
| Other (maintenance, docs) | 17 | +15,316 | -27,310 | 447 |
| Total | 106 | +81,261 | -56,766 | 1,471 |
Test Suite Growth
What's Working
- 85% fully autonomous builder completion
- 100% context recovery via porch state
- Phase-gated review catches bugs early
- 66% of bugfixes ship in under 30 minutes
- Reviewer complementarity — no single model catches everything
What Needs Improvement
- Codex reads main instead of builder worktree (8+ wasted iterations)
- Review effectiveness degrades above ~40 files
- Process lifecycle bugs escape static review
- No supermajority override when 2/3 reviewers approve
Methodology
All claims backed by specific PR numbers, commit hashes, review file
citations, or consult stats output.
Data covers 26 SPIR/bugfix projects (Specs 0102–0127, 0350, 0364,
bugfix-274, bugfix-324), 106 merged PRs, 105 closed issues, and 801 non-merge
commits over Feb 3–17, 2026. Industry benchmarks from LinearB 2026 (8.1M PRs),
Worklytics 2025, byteiota 2026.
Full source reviews available in the
Codev repository
.
Want autonomous builders shipping code in your codebase?