Development Analysis · Feb 3–17, 2026

106 PRs in 14 Days. One Architect.

Output equivalent to a 3–4 person elite engineering team. 85% of builders completed fully autonomously. 20 pre-merge bugs caught by multi-agent review.

Data sources: 26 review files, 106 merged PRs, 105 closed issues, 801 commits, consult stats

106
PRs merged
85%
Fully autonomous
57m
Median build time
20
Pre-merge bugs caught

Key Metrics

53/week
PRs merged per week
$1.59
Consultation cost per PR
3.4x
Return on investment
100%
Context recovery success rate

Team Equivalent

By PR volume alone, one architect with autonomous builders matched 11–19 individual developers against industry elite benchmarks.

PRs merged / week (Codev)53
PRs / dev / week (elite benchmark)5.0
PRs / dev / week (median benchmark)2.8
Developer-equivalents (elite)~11
SPIR features only (conservative)3–4 elite devs

Sources: LinearB 2026 (8.1M PRs), Worklytics 2025, byteiota 2026. Caveats: solo codebase, no cross-team coordination overhead, single TypeScript project.

Autonomous Builder Performance

22 of 26 builders (85%) completed fully autonomously. The 4 interventions were caused by infrastructure issues, not builder capability gaps.

22/26
Fully autonomous (no intervention)
26/26
Reached PR creation (100%)

The 4 Interventions

0102, 0103 Pre-existing broken tests blocked porch advancement despite correct implementations
0104 Claude consultation timeouts on a 3,700-line file (motivated Spec 0105: Server Decomposition)
0106 Merge artifacts from 40+ file rename (builder detected and self-resolved)

Context Recovery: 100% Success Rate

4 specs exhausted context windows (all had 40+ files or 5+ phases). Every recovery succeeded via porch status.yaml + git history. No builder had to redo completed work.

Spec Files Context Windows
0104744 compactions
010528 (7 phases)2
01121243
012680+3

Remaining 22 projects completed within a single context window.

How Fast Builders Ship

Median autonomous implementation time: 57 minutes from first commit to PR creation.

57m
Median (SPIR features)
38h 12m
Total autonomous time
13/24
Completed in <60 min

Bugfix Pipeline

Issue filed → builder spawned → fix implemented → 3-way review → merged. Fully automated end-to-end.

66%
Ship in under 30 min
13m
Median PR→merge time
59
Bugfix PRs merged

Multi-Agent Review Catches

20 pre-merge bugs caught: 1 security-critical, 8 runtime failures, 11 quality/completeness gaps.

Security-Critical

Socket permissions gap

Codex · Spec 0104

Shellper Unix socket created without restrictive permissions (0600). Any local user could connect.

Runtime Failures (8)

Catch Spec Reviewer Description
Startup race condition0105Codex + GeminigetInstances() returns [] before initInstances() completes, breaking dashboard on startup
body.name truthiness bug0107Codex{ name: "" } treated as reconnect instead of validation error
Nonce placement error0107ClaudeOAuth nonce placed on authUrl instead of callback, breaking CSRF protection
Pong timeout not armed0109CodexDead WebSocket connections never cleaned up when ws.ping() throws
stderrClosed value-copy bug0113ClaudeBoolean was a local copy, never updating the session object
Zero-padded spec matching0126CodexgetProjectSummary() failed to match spec files with leading zeros
Bugfix regex extraction0126CodexIssue number extraction didn't handle all bugfix branch naming patterns
Missing workspacePath0126ClaudeCritical: Dashboard would have been completely non-functional for workspace views

Quality & Completeness (11)

Catch Spec Reviewer
gate-status.ts deletion prevented (would break build)0108Gemini
Gate transition vs re-request spam0108Codex
buildWorktreeLaunchScript side effect0105Claude
tower.html rename miss (124-file rename)0112Gemini
Stale StatusPanel test assertions0112Gemini
Pre-HELLO gating (unauthenticated socket access)0118Codex
Workspace scoping for sockets0118Codex
Unnecessary templates.test.ts change prevented0111Claude
ReconnectRestartOptions missing0104Codex
Documentation gaps (INSTALL, MIGRATION, DEPENDENCIES)0104Gemini
Secondary race path in /api/stateBF #274Codex

Reviewer Effectiveness

No single reviewer caught all bugs. Each model's blind spot is another's strength.

Codex
11 unique catches · ~25% FP rate

Security edge cases, test completeness, exhaustive sweeps. Found the socket permissions gap, truthiness bug, and secondary race path that others missed.

Claude
5 catches · ~8% FP rate

Line-by-line traceability, type safety, critical runtime bugs. Caught the two most critical issues: missing workspacePath (would have broken entire dashboard) and stderrClosed value-copy bug.

Gemini
4 catches · ~3% FP rate

Architecture, documentation, build-breaking deletions. Caught the gate-status.ts deletion that would have broken the build and the tower.html rename miss in a 124-file change.

What Escaped Despite Review

16 post-merge escapes (8 code defects, 8 design gaps). Understanding the boundaries of multi-agent review.

Issue Description Why Review Missed It
#294Shellper process leakResource lifecycle across process boundaries
#313Terminal unresponsive under backpressureFlow control behavior under load
#324Processes don't survive restartPipe FD lifecycle dependency
#319Duplicate notificationsEvent ordering in async pipeline
#335Notify before review completesRace between notification and consultation
#336Worktree changes leak via CWDSide effect invisible in diff review
#342Consult subprocesses never exitProcess cleanup in SDK teardown
#341Orphaned processes accumulateResource lifecycle across sessions

Key Pattern

5 of 8 code-defect escapes were process lifecycle issues (shellper leaks, orphaned processes, pipe FD dependencies). These are fundamentally hard to catch via static code review — they require runtime observation of process behavior over time. Spec 0126 (80+ files, 6 phases) alone produced 7 of 16 total escapes, suggesting review effectiveness degrades with spec complexity.

Cost & ROI

Model Invocations Avg Duration Cost Success Rate
Claude2,2918s$96.6984%
Codex61321s$70.8163%
Gemini21164s$1.14*98%
Total3,11512.2h$168.64*81%

*Gemini cost tracking was broken during this period (Bug #374). Actual Gemini costs likely 10–50x higher. Corrected estimate: $180–$225 total.

Category Hours
Savings: Security catches (1)~10h
Savings: Runtime failures (8)~12h
Savings: Quality/completeness (11)~11h
Total Savings~33h
Overhead: False positive iterations (~36 × 5 min)~3.0h
Overhead: Consultation wait time (~200 rounds × 2 min)~6.7h
Total Overhead~9.7h
Net Value~23.3 hours
ROI~3.4x

Conservative floor (halving security estimate): ~28h saved, ~18.3h net, 2.9x ROI. Cost efficiency: $8.43 per catch, $5.11 per hour of engineering time saved.

Raw Output

801
Non-merge commits
2,698
Files changed
+95K
Net lines of code
PR Type Count Additions Deletions Files
SPIR (feature)30+54,049-22,049614
Bugfix59+11,896-7,407410
Other (maintenance, docs)17+15,316-27,310447
Total106+81,261-56,7661,471

Test Suite Growth

~845
Tests at period start
~1,368
Tests at period end
+523
Net test growth

What's Working

  • 85% fully autonomous builder completion
  • 100% context recovery via porch state
  • Phase-gated review catches bugs early
  • 66% of bugfixes ship in under 30 minutes
  • Reviewer complementarity — no single model catches everything

What Needs Improvement

  • Codex reads main instead of builder worktree (8+ wasted iterations)
  • Review effectiveness degrades above ~40 files
  • Process lifecycle bugs escape static review
  • No supermajority override when 2/3 reviewers approve

Methodology

All claims backed by specific PR numbers, commit hashes, review file citations, or consult stats output. Data covers 26 SPIR/bugfix projects (Specs 0102–0127, 0350, 0364, bugfix-274, bugfix-324), 106 merged PRs, 105 closed issues, and 801 non-merge commits over Feb 3–17, 2026. Industry benchmarks from LinearB 2026 (8.1M PRs), Worklytics 2025, byteiota 2026. Full source reviews available in the Codev repository .

Want autonomous builders shipping code in your codebase?