The AI Lab

Engineering Environment — Rung 5 (Dark Factory)

The AI Lab is a parallel operating environment targeting the highest rung of AI-driven development. It hosts greenfield projects and serves as the proving ground for practices, agents, and workflows that will later migrate to the rest of engineering.

The working model is non-interactive development: specs and scenarios drive autonomous agents that write, test, and iterate the code. The human defines. AI executes.

The Lab operates outside of engineering's standard operating procedures. It has its own rules.


Two Maturity Scales

This document uses two distinct scales. Conflating them creates ambiguity — they measure different things.

Organizational Scale (Levels 1-3)

Defined in the reference framework, it applies across the entire group — engineering, marketing, sales, finance, customer service.

LevelNameSummary
1AI-AssistedAI is a tool that individuals choose to use
2AI-IntegratedAI is integrated into workflows and systems
3AI-NativeThe organization is designed around AI as a first-class resource

Engineering Scale (Rungs 0-5)

Based on Dan Shapiro's framework, it describes the specific progression of software development — from assisted coding all the way to the dark factory.

RungHuman's roleWho writes the codeWho reviews the code
0 — AutocompleteHuman codes, AI suggests the next lineHumanHuman
1 — InternHuman assigns discrete, scoped tasksAIHuman (everything)
2 — Junior developerHuman supervises multi-file changesAIHuman (everything)
3 — ManagerHuman directs, reviews at feature/PR levelAIHuman (PR)
4 — Product managerHuman writes the spec, verifies if tests passAINobody (tests verify)
5 — Dark factorySpec goes in, software comes outAINobody (scenarios verify)

Mapping Between the Two Scales

Organizational scaleEngineering scale
Level 1 — AI-AssistedRungs 0-1
Level 2 — AI-IntegratedRungs 2-3
Level 3 — AI-NativeRungs 4-5

The Lab targets Rung 5. Engineering outside the Lab aims for organizational Level 3 (Rungs 4-5).

The hardest transition is the shift from Rung 3 to Rung 4: accepting that you no longer read the code and trusting scenarios to validate the result. It's a psychological change before it's a technical one. Most engineers plateau at Rung 3 because letting go of control over the code goes against all their professional instincts.


1. Absolute Rules

Two rules define the Lab. They are not aspirations — they are conditions of admission.

Code must not be written by humans.

Code must not be reviewed by humans.

The human defines the architecture, constraints, and satisfaction scenarios. AI produces the code, runs the tests, and converges toward the solution. If you're writing or reading code line by line, you're not operating in the Lab's working mode.


2. Project Admission Criteria

Greenfield projects

The Lab's natural terrain. No legacy, no technical debt, no habits. The Lab's rules (Rung 5) apply end-to-end from day one.

Admission criteria:

  • Scope is sufficiently defined to write specs and scenarios
  • The project can tolerate a learning pace

Brownfield projects (transition to Rung 5)

The Lab also takes on existing projects being transitioned to Rung 5. This is harder than greenfield — the code exists, and so do the habits — but this is where the transformation has the most impact, because this is where the majority of engineering work lives.

Admission criteria:

  • The project has sufficient scenario coverage (or the team commits to building it first)
  • The team accepts that all new work (additions, modifications, refactoring, reviews) follows the Lab's rules — no falling back to the traditional mode
  • Existing code is treated as context for the agent, not as untouchable. The agent can refactor, rewrite, and restructure.
  • Regression risk is managed by scenarios, not by human code review

Typical sequence for a brownfield:

  1. Extract the implicit specification — The existing system IS the specification. Nobody ever documented the thousand implicit decisions accumulated over years of patches, hotfixes, and workarounds that became permanent. This extraction is the hardest and most human work in the transition. It requires the people who know why this module has that exception, why this service was split that way, why this value is configured like that. AI can help document what the system does (generate specs from code). But distinguishing intentional behaviors from historical accidents remains a human judgment.
  2. Write end-to-end scenarios that describe the current expected behavior, based on the specification extracted in step 1
  3. Verify that the scenarios pass on the existing code
  4. From that point on, all changes are made by the agent, validated by scenarios
  5. Iterate: each transitioned component increases the project's Rung 5 coverage

What Does NOT Belong in the Lab

  • Any project whose development continues using traditional practices (human writes or reviews code)
  • Projects whose delivery constraints tolerate zero learning risk

Rule: the entry condition isn't the absence of existing code — it's the commitment that all new work follows the Lab's rules.


3. Working Mode: Non-Interactive Development

The Cycle

  1. The human writes the specification (architecture, constraints, scenarios)
  2. The agent produces the code
  3. Scenarios validate the result
  4. The human evaluates satisfaction and iterates on the specification if needed

The human doesn't intervene in execution. The human intervenes in definition and evaluation.

Scenarios vs Tests

  • Tests: validations stored in the code. Vulnerable to gaming by agents — an agent can rewrite a test to make it pass. Useful but insufficient.
  • Scenarios: end-to-end user journeys that describe expected behavior from the user's perspective. Harder to circumvent. The Lab favors scenarios.

Satisfaction Metric

The Lab doesn't measure success in binary (tests green / red). It measures satisfaction: "across all observed trajectories through all scenarios, what fraction satisfies the user?"

When satisfaction is insufficient, the problem is in the specification, not in the agent. Iterate on the spec, not the code.

The Critical Skill: Writing Specs for an Agent

The Lab's bottleneck isn't implementation speed — it's spec quality. Writing a spec precise enough for an agent to implement correctly without human intervention is a new skill. Almost nobody has developed it.

The difficulty: when a human receives an ambiguous spec, they fill the gaps with judgment, context, or a Slack message asking "did you mean X or Y?" The agent builds what you described. If the description is ambiguous, the software fills the gaps with machine assumptions, not customer intuitions.

This skill is developed through practice:

  • AI clinics should include spec reviews: "here's my spec, here's what the agent produced, here's what was missing from the spec"
  • Pair sessions should work on specification exercises, not just code exercises
  • Every failed iteration is a signal about the spec, not the agent — document what the spec didn't state clearly enough

The Lab's goal isn't just to produce software through agents. It's to develop engineers who can specify with the rigor that agents demand.


4. Deliberate Naivety

The biggest obstacle to Rung 5 isn't technical — it's habit.

Experienced engineers have deep reflexes: structuring code a certain way, reviewing line by line, writing tests themselves, refactoring manually. These reflexes were strengths in traditional development. In the Lab, they're obstacles.

Deliberate naivety means:

  • Removing traditional development conventions and seeing what holds without them
  • Systematically asking: "Why am I doing this? The model should be doing it instead."
  • Accepting that approaches that seem "naive" or "incorrect" by traditional standards may be correct in an AI-native environment
  • Treating tasks historically deemed too expensive (building full service replicas, writing thousands of scenarios) as routine when AI execution costs make it feasible

The Lab's permanent question:

Why am I doing this? The model should be doing it instead.

If the answer is "because I've always done it that way," that's exactly the reason to change.


5. Support Role

The Lab isn't isolated from the rest of engineering. It serves it.

The Lab produces:

  • Documented working patterns: how to specify for an agent, how to write scenarios, how to evaluate satisfaction
  • Reusable or adaptable agents
  • Concrete proof that Rung 5 works on real projects
  • Honest feedback — what works and what doesn't work yet

The Lab shares through:

  • AI clinics: regular sessions, short format. "Here's what we tried, here's what happened."
  • Documentation: every discovered pattern and anti-pattern is documented
  • Lab / non-Lab pairing: a Lab member temporarily works with a non-Lab engineer to transfer practices

A Lab that doesn't share is useless. Sharing is as important as production.


6. Economic Signal

A concrete signal that the Lab operates at Rung 5: the AI token budget.

If a team in the Lab isn't spending significantly on tokens per day, they probably aren't making agents do the work — they're doing it themselves.

The token budget is an indicator of AI workload, not a goal in itself. But if tokens are near zero, the human is doing the agent's work, and the Lab's absolute rules are not being followed.


7. Lab Culture

The Lab has a distinct culture from the rest of the organization:

  • Mandatory curiosity — the question "what if we tried..." is always welcome
  • Aggressive monitoring — Lab members stay on top of the latest AI model advancements. When a new model or tool drops, the Lab tests it quickly and evaluates whether it's a game-changer. Waiting for things to "mature" is incompatible with the Lab.
  • Boldness in methods, rigor in commitments — the Lab pushes boundaries on how we work: what tools we adopt, what workflows we reinvent, what "naive" approaches we test. But contractual, economic, legal, and security obligations to customers remain non-negotiable. Boldness applies to the means, not the guarantees.
  • High risk, low stakes — Lab projects are chosen to tolerate failure. Use that to take risks you wouldn't take elsewhere
  • Radical transparency — failures are shared with as much detail as successes. A documented failure has more value than a silent success
  • Leadership means elevating the team — in the Lab, leadership isn't measured by individual performance. Leaders are those who make the rest of the team better: who share their discoveries, document their patterns, unblock their colleagues, and turn their expertise into reproducible practices. A brilliant engineer who keeps their methods to themselves is not a Lab leader.
  • Iteration speed — the spec → agent → scenario → evaluation loop must be fast. If an iteration takes days, the cycle is too heavy

8. Pitfalls to Avoid

  • Reverting to habits — the reflex to "manually check the code just to be sure" is exactly what the Lab forbids. If the scenarios pass, the code is validated by the scenarios.
  • Insufficient specifications — when the agent produces bad code, the problem is in the spec. Iterate on the spec, not the code.
  • Isolation — a Lab that doesn't share its learnings is a hobby, not a Lab.
  • Too-critical projects too early — the Lab has high risk tolerance. Don't put in a project whose failure endangers a customer or a contract.
  • Agent perfectionism — the goal isn't a perfect agent. It's an agent that produces value. Iterate.
  • Brownfield without spec extraction — transitioning an existing project without first extracting the implicit specification and writing scenarios that protect current behavior is flying without a net. The extraction is the hardest work — don't underestimate it.
  • "Half-Lab" brownfield — if part of the work on a brownfield project is done in traditional mode "because it's faster for that part," the project isn't in the Lab. The rules are absolute, even when it's uncomfortable.

9. Lifecycle

  • Phase 1: The Lab starts with greenfield projects (the most natural terrain) and begins transitioning 1-2 selected brownfield projects. Small team. Absolute rules in effect. Output = delivered projects + documented practices + working agents + brownfield transition playbook.
  • Phase 2: Brownfield projects transitioned in the Lab become the reference cases for the rest of engineering. Engineers who went through the Lab become the pairing partners for those who haven't. More brownfield projects enter the Lab.
  • End state: The Lab has absorbed engineering. The distinction disappears. Everything is Rung 5. The Lab was never a destination — it was the transition vehicle. Both greenfield AND brownfield projects operate under the same rules.

Summary Rule

Why am I doing this? The model should be doing it instead.

If the answer is "because I've always done it that way," that's exactly the reason to change.


← Back to home · The reference framework · AI Execution Standards · Glossary