What does Sam Zoloth do?

Sam Zoloth is a multi-disciplinary product leader with 10+ years of experience driving outcomes at growth-stage startups and enterprise organizations. He specializes in product strategy, growth, and organizational transformation.

What companies has Sam worked with?

Sam has worked with DreamWorks Animation, Wasabi Technologies, HP, Cengage, Form Health, GoHunt, Echo Global Logistics, and SteamShip Authority across roles in product management, growth, and analytics.

How can I contact Sam Zoloth?

You can reach Sam via email at sam@samzoloth.com or connect on LinkedIn. He's currently open to product leadership opportunities.

What industries does Sam specialize in?

Sam has deep experience in B2B SaaS, cloud storage, enterprise software, digital media/entertainment, and healthcare tech. His expertise spans growth product, product strategy, and analytics.

Does Sam do consulting or contract work?

Yes, Sam is available for product strategy consulting, fractional product leadership, and organizational transformation engagements. Visit the services page for more information.

The Software Industrialization Thesis

What happens when building software costs nothing

Seven chapters tracing what AI does to software features, teams, business models, and competitive strategy. Stress-tested against eight canonical frameworks. ~19,603 words.

~86 min read

.md

Preface

Software development is industrializing. The craft era — where skilled individuals wrote code by hand — is being replaced by a factory model where humans design production systems and autonomous agents operate them. This shift commoditizes features, restructures teams around infrastructure rather than individual productivity, and compresses the human role to judgment: what to build, why, and whether the output is good enough.

That paragraph took me about a year to write.

Not because the words were hard. Because the argument underneath them kept growing. I started in early 2025 writing what I thought would be a single LinkedIn article about how AI was changing product strategy. That article became two. Then four. Then a 14-source industry report. Then a unified theory document I kept returning to at odd hours, testing each claim against a new piece of evidence, finding that the predictions kept holding.

The ideas in this book developed the way most honest arguments develop — not as a grand vision delivered whole, but as a series of encounters with specific problems that turned out to be connected. I noticed that features were getting easier to copy and wrote about what that meant for defensibility. I tried to scale my personal AI setup to a team and discovered the structural reasons it broke. I read the engineering disclosures from Stripe, Spotify, OpenAI, and Ramp and realized they'd all converged on the same architecture independently — which meant the pattern was stable, not accidental. I watched organizations adopt AI tools and saw the same failure modes repeat across companies that had never talked to each other.

Each piece went deeper than the last. Each one pulled on threads from the previous ones. At some point I stopped thinking of them as separate articles and started thinking of them as chapters in a cumulative argument about what happens when building software costs nothing.

That is what this book tries to lay out.

Who this is for

This is written for product leaders, engineering leaders, and founders who are past the "should we use AI?" conversation and into the harder one: what does AI's impact on software production mean for our strategy, our team structure, and our business model?

If you manage a product roadmap, the first two chapters will probably be uncomfortable. They argue that the thing most roadmaps optimize for — feature delivery — is becoming the least defensible category of work. Not irrelevant, but no longer strategic in the way it used to be.

If you lead an engineering organization, the middle chapters deal with the infrastructure and organizational shifts required to move from individual AI productivity (which doesn't scale) to team-level AI systems (which do, but demand different investments than most leaders expect).

If you're a founder or executive, the later chapters address the business model pressure that arrives when your competitors can rebuild your feature set in a quarter — and what the durable alternatives look like.

What this is not

This is not a how-to guide for AI coding tools. I won't walk you through prompt engineering or tell you which model to pick. The tools are changing too fast for that kind of advice to age well, and there are better resources for it anyway.

This is not a prediction about AGI timelines. I don't know when or whether we get artificial general intelligence, and I'm skeptical of anyone who claims to. The arguments here are grounded in what's already happening — production systems running today at organizations willing to publish their numbers. The thesis doesn't require any capability breakthroughs beyond what already exists.

This is not cheerleading or dooming. I'm not here to tell you AI will save software or destroy it. The industrialization of software is a structural shift, like the ones that hit manufacturing, media, and logistics before it. Structural shifts create winners and losers. They reward people who understand the mechanism and punish people who pretend it isn't happening. I'm trying to describe the mechanism clearly enough that you can decide what to do about it.

How to read this

The book is organized in three acts. Act I diagnoses what AI is doing to software — the commoditization of features, the hidden complexity of scaling AI-assisted development, the shift in the human role, and the emergence of the factory model. Act II stress-tests the diagnosis against eight canonical strategy frameworks and finds that every one of them reaches the same conclusion. Act III turns prescriptive — what to do about it, and where the business model goes next.

The chapters build on each other. The defensibility argument in Chapter 1 sets up the team infrastructure argument in Chapter 2, which sets up the role shift in Chapter 3 and the factory model in Chapter 4. You can read them independently — they started as standalone articles, and each one still works on its own — but the cumulative case is stronger than any individual piece.

I've kept the evidence specific. When I cite Stripe merging a thousand agent-written pull requests per week, or OpenAI's three-engineer team producing a million lines of code in five months, those numbers come from public engineering disclosures that you can verify. When I describe architectural patterns, I'm drawing from seven organizations that built these systems independently and converged on the same design. The convergence matters more than any single data point.

One last thing. The thesis itself is an example of its own argument. These chapters are content — features, in the language of Chapter 1. What's harder to replicate is the analytical framework underneath them, the operational experience that shaped them, and the judgment about which evidence matters and why. I've tried to make that judgment visible throughout. You'll have to decide whether it holds up.

Chapter 1

The End of Features

For years, software companies got to count build difficulty as strategy.

If a feature took a team of 10 engineers six months to ship, that difficulty bought you time. Competitors could copy the idea, but they still had to pay the same tax in headcount, coordination, and calendar time.

AI is stripping that protection out of the equation.

Product quality still matters. What changes is the thing many companies were calling a moat. A lot of the time it was just expensive production. When the cost of building falls fast enough, the strategic question changes. It stops being "what can we ship next?" and becomes "what do we have that someone else cannot rebuild in a quarter?"

The numbers make the shift concrete. Stripe's internal agent system — Minions — merges over a thousand agent-written pull requests per week. No human writes the code; humans review only. Spotify's background agent, Honk, has merged more than 1,500 PRs across hundreds of repositories, and roughly half of all Spotify PRs are now automated. Ramp reached 30% of all merged PRs coming from its agent within a couple months of launching. OpenAI built an entire internal product — approximately a million lines of code — with three engineers, zero manually-written code, in five months.

Those are not research demos. Those are production systems at companies processing real revenue. And the trajectory points in one direction: the cost of building features is collapsing toward the cost of describing them.

The map looks like this:

                    Easy to copy    Hard to copy
                 +---------------+---------------+
   Digital       |   Features    |   Networks    |
   (bits)        |   and code    |   and data    |
                 +---------------+---------------+
   Physical      |   Hardware    |   Atoms and   |
   (atoms)       |   specs       | relationships |
                 +---------------+---------------+

Most product teams still spend most of their time in the upper left.

Why this changed

The easiest way to misread AI is to think it mainly makes software teams more productive. It does that. The bigger change is that it makes feature production cheaper, faster, and easier to replicate.

The signs are already here:

Frontier models can produce production-grade code inside real repositories.
Background agents now work in isolated environments, run tests, and open pull requests without a human sitting there.
Open-source orchestration blueprints are turning what used to require a platform team into something a motivated engineer can assemble over a weekend.

Once that happens, a feature stops looking like a strategic asset and starts looking like manufactured output. A useful one, sometimes a necessary one, but not the thing that protects the business.

Ben Thompson's Aggregation Theory helps explain the mechanism. Thompson's core insight is about what happens to value chains when a layer of the stack gets commoditized. When distribution was expensive, newspapers could charge advertisers because they controlled access to readers. When the internet made distribution free, value migrated to whoever could aggregate demand — Google, Facebook — and away from the suppliers of interchangeable content.

The same structural logic applies to software features. For decades, the "supply" side of software — the actual building — was expensive enough to function as a barrier. You needed engineers, time, and coordination. That expense meant the supply of any given feature was limited, which gave the companies that built them pricing power.

AI is doing to software features what the internet did to media content: making supply abundant. When supply gets abundant, Thompson's framework predicts that value migrates toward demand aggregation and away from supply. The companies that own user relationships, distribution, and data — the demand side — gain leverage. The companies whose value proposition is "we built this feature so you don't have to" lose it, because the cost of "building it yourself" is dropping toward zero.

This doesn't mean features become worthless. Content didn't become worthless when distribution got cheap. But content stopped being the defensible layer. It became the commodity input to someone else's aggregation model. Features are heading to the same place.

The speed of the shift matters too. Cursor's research system peaked at a thousand commits per hour running a multi-agent harness continuously for a week. OpenAI's team averaged 3.5 PRs per engineer per day — and the engineers weren't writing code, they were reviewing it. These rates mean that a startup can ship on Monday and face a functional clone by Tuesday. The calendar-time moat that used to protect complex features has been sanded down to almost nothing.

What survives

Some assets get stronger when code gets cheaper because they were never made of code in the first place.

Networks

You cannot prompt your way into a social graph. Spotify's recommendation strength is tied to hundreds of millions of listening histories. Meta's position is tied to the topology of human relationships. The product surface matters, but the compounding asset underneath it matters more.

Networks have a specific property that makes them resilient to feature commoditization: they get more valuable as more people use them, and that value is locked inside the network itself. You can clone Spotify's interface. You cannot clone the behavioral data of 600 million users making daily listening choices. The gap between "what the product looks like" and "what makes the product work" widens every day those users keep listening.

Data flywheels

Google's edge is not a single search feature. It is the loop between usage, ranking signals, and model quality. NVIDIA's position is not just CUDA as an API. It is the accumulation of libraries, workflows, and institutional habits built around it. These assets improve as they are used.

The flywheel distinction matters because data flywheels create a compounding advantage that accelerates over time. A feature, even a brilliant one, depreciates. Someone copies it or builds something better. A data flywheel does the opposite — each interaction makes the next interaction more valuable. When AI makes the feature layer cheap, the relative value of the flywheel layer increases. The gap between "we have a good product" and "we have a good product built on a decade of usage data" becomes the whole ballgame.

Relationships and embeddedness

Enterprise software survives on more than UI. Procurement history, compliance posture, workflow integration, and trust built over years all create switching costs. Microsoft is not hard to displace because its interfaces are impossible to copy. It is hard to displace because thousands of organizations have already bent themselves around it.

This is worth dwelling on because it's the moat least visible to product teams. Nobody ships "embeddedness" as a feature. It accumulates through years of integrations, training investments, compliance certifications, and organizational muscle memory. An AI agent can replicate a competitor's dashboard overnight. It cannot replicate the fact that a customer's entire procurement workflow, approval chain, and audit trail run through your system.

Physical infrastructure

Data centers, fabs, fleets, logistics networks, and the capital structures behind them do not become easy to copy because code generation improved. You can clone a dashboard. You cannot clone a fabrication plant on a weekend.

These are the assets I would bet on in a world where feature production gets cheap.

The Aggregation Theory lens

It's worth making the Thompson connection more explicit, because it clarifies which companies are in trouble and which aren't.

In Thompson's framework, pre-internet media value chains had three layers: suppliers (journalists, studios), distributors (newspapers, TV networks), and consumers. Distributors captured most of the value because they controlled access to consumers. The internet blew up distribution costs, which commoditized the distributors and shifted value to aggregators — platforms that sat between abundant supply and consumer demand.

Software is developing an analogous structure. The three layers are: builders (engineering teams), products (the software itself), and users. For decades, the building layer was expensive enough to function as a bottleneck. Products captured value because they were hard to create. AI is removing that bottleneck. Building is getting cheap. When building gets cheap, the product layer — the feature set — starts to look like content in the media analogy: abundant, interchangeable, and no longer the defensible layer.

Where does value migrate? Toward the layers that still have scarcity. User relationships. Distribution. Proprietary data. Regulatory position. Physical infrastructure. The things on the right side of the defensibility matrix.

This is why "ship features faster" is a trap. It's the equivalent of a newspaper saying "write more articles" in 2005. More supply doesn't help when supply is no longer scarce. What helps is controlling the layers that are still scarce.

Why this is uncomfortable

Most product organizations are not staffed, measured, or rewarded around those assets. They are rewarded around visible shipment.

Roadmaps make this obvious. The work that gets celebrated is new screens, integrations, automation flows, and launch announcements. The work that compounds a data advantage, deepens a customer relationship, or increases workflow embeddedness often gets treated like supporting detail.

That was always a mild problem. It becomes a strategic one when the visible work gets easier for everyone at once.

If your product strategy is mostly "we ship a better set of features than the next company," AI is attacking the part of the stack you rely on most. Faster execution helps, but it helps your competitors too. The result is not no progress. The result is faster convergence — everyone arrives at the same feature set quicker, and the window during which your feature advantage matters shrinks from years to months to weeks.

The companies most exposed are pure-feature SaaS businesses — the ones whose entire value proposition is "we built this so you don't have to." If an AI orchestration layer can build it for the customer directly, tailored to their specific workflow, the per-seat middleman loses pricing power. Not immediately. Not completely. But the pressure is structural and it only moves in one direction.

There's a flip side, though, and it's worth naming. The same force that makes features indefensible for companies makes them accessible for individuals and small teams. Purpose-built software for specific workflows becomes viable at a scale of one. The person who understands a problem deeply can now build the solution directly, without needing to convince a product team to prioritize it or a company to fund it. That democratization is real, and it matters.

Boundaries

Features are not irrelevant. Users still buy products, not abstract moats.

Taste does not disappear. Product judgment still matters. A clean workflow, a strong opinion about how work should happen, and a better interface can create real value.

Not every business needs a network effect or a fab. Some companies will survive through deep vertical specialization, regulated workflow integration, or proprietary operating data that is small in scale but rich in value.

The narrower claim is simpler: features without something compounding underneath them are weaker bets than they used to be.

I think that's right.

How to use this

If you are a founder or product leader, audit the roadmap in two passes.

First pass: what are we shipping because users need it right now?

Second pass: what are we shipping that becomes harder to replace the more it is used?

Those are different categories of work. Both matter. Only one deserves to be called strategy.

The simplest test is brutal:

If a competitor shipped this same feature in 90 days, would our position materially change?
If users could recreate a rough version with the same AI tooling we have, would we still have an advantage?
If this work succeeds, what compounds?

If the answer to the last question is "nothing," you may still need to build it. Users still need the thing. But you should stop mistaking necessity for defensibility.

That is the real shift. The upper-left quadrant does not go away. It just gets demoted. Features become the price of entry. The durable part of the business has to live somewhere else.

Chapter 2

The Hidden Complexity

My personal AI setup made me much faster.

Then I tried to share it with a team, and it made everyone slower.

That is where the category error showed up. The same ingredients that made one person fast made the group harder to coordinate. My assumptions were invisible. My shortcuts were private. My prompts encoded decisions nobody else had agreed to. The setup worked because it was mine. That was exactly why it did not scale.

This is the part of team AI adoption most organizations still miss: individual AI productivity and team AI collaboration are different problems. If you solve the first one without redesigning for the second, you usually create confusion faster than output.

The previous chapter argued that features are commoditizing — that the strategic value of building software is collapsing. This chapter deals with the organizational consequence: if you try to capture that speed at team scale using the same approach that works for individuals, you hit a wall. And the wall isn't about models or prompts. It's structural.

Three ways this breaks

The failure mode is not "people need better prompts." It is structural.

Context redundancy

Five engineers with five AI setups do not create a shared system. They create five local realities.

One person's prompts assume service boundaries that another person has never accepted. One person's local automations nudge the agent toward a style the rest of the codebase does not use. The mismatch stays hidden until AI-generated work collides in code review.

The team gets more output and less coherence.

This is a familiar problem wearing new clothes. Software teams have always struggled with alignment — that's what style guides, architecture reviews, and coding standards exist to address. But AI amplifies the divergence rate. A human engineer drifting from team conventions produces drift at human speed. An AI agent drifting from team conventions produces drift at machine speed, across more files, in less time. Five engineers with five different AI setups can generate five different interpretations of the codebase in an afternoon.

Ramp's experience is instructive here. When their internal agent reached 30% of all merged pull requests, the context redundancy problem became impossible to ignore. Their solution wasn't to standardize everyone's prompts — it was to give agents access to the same systems humans already depend on: repos, CI, data stores, internal tools. Context became part of the environment instead of something each developer re-explained by hand. The lesson: at 30% of your PRs, personal context becomes organizational liability.

Context decay

AI helps teams create decisions faster than they can validate them. Architectural notes, coding conventions, automation shortcuts, and half-proven assumptions spread through the system before anyone has time to check whether they still hold.

That is not a theory problem. It is an operating problem. Spotify's engineering team has been unusually transparent about this. Their research on background coding agents surfaced a specific pattern: architectural evidence goes stale faster than teams expect, and most teams discover the rot late — during incidents or cleanup — instead of early, when it's cheap to fix.

The decay rate matters because AI-assisted development accelerates both the creation of decisions and the creation of artifacts that encode those decisions. A team without AI might update an architecture document quarterly. A team with AI agents might produce dozens of PRs per day, each one embedding assumptions about the architecture. If the underlying architectural decisions shift — and they always do — the gap between "what the code assumes" and "what's actually true" widens faster than anyone can manually reconcile.

Spotify quantifies one dimension of this: their LLM-as-judge verification layer vetoes roughly 25% of agent sessions. A quarter of the time, the agent produced code that looked correct but didn't match the actual intent. That veto rate is the visible surface of context decay — the cases where automated checking caught the drift. The invisible surface is harder to measure: the PRs that passed review but embedded slightly stale assumptions that won't surface until something breaks in production.

OpenAI attacked this problem from a different angle. Their team used to spend every Friday — 20% of the engineering week — manually cleaning up what they called "AI slop": code that was technically correct but architecturally drifting. That didn't scale, so they automated the cleanup itself, running a recurring set of background agents that scan for deviations, update quality grades, and open targeted refactoring PRs. The context decay didn't stop. They just built machinery to counteract it continuously.

Capability silos

Every team trying to scale AI eventually hears some version of the same complaint: "Wait, we already built a tool for that?"

Someone has a great deployment helper. Someone else has a useful review prompt. A third person wired the assistant into CI. None of it propagates cleanly, because personal setups are bad distribution channels for team capability.

The irony is sharp. The tools are supposed to increase shared capability. Used carelessly, they fragment it.

This fragmentation has a compounding cost. Each siloed capability represents duplicated effort — multiple people solving the same problem independently. But worse, it represents divergent solutions to the same problem, which means the team's shared understanding of "how we do things" fractures. When three people have three different deployment helpers, the team doesn't have a deployment process. It has three personal rituals.

The wrong rollout

The default rollout pattern is easy to recognize:

Create a shared prompt doc
Dump tips into Notion
Tell everyone to write a CLAUDE.md
Maybe collect a few reusable snippets

This feels organized because it produces artifacts. It does not produce shared reality.

The underlying problem remains. Truth still lives in private setups, old docs, tribal knowledge, and whatever the most motivated person remembers to update.

OpenAI tried the centralized-document approach and documented exactly why it failed. They started with a single large AGENTS.md file meant to be the canonical source of guidance for their coding agents. The problems were predictable:

Hard to verify — drift from reality is inevitable in a monolithic document.
Rots instantly — a single manual becomes a graveyard of stale rules.
Too much guidance becomes non-guidance — when everything is "important," nothing is.
Crowds out actual task context from the agent's context window.

The failure mode is worth naming precisely because it's so tempting. A shared document feels like infrastructure. It has the shape of a solution. But a document that nobody maintains, that drifts from the codebase it describes, and that competes with task-specific context for the agent's limited attention — that document is worse than no document, because it gives the team false confidence that shared context exists.

The clearest public example I have seen is not a prompt library. It is infrastructure.

Ramp's internal agent adoption got attention because of the number: roughly 30% of merged pull requests. The deeper lesson is architectural. They did not try to make everyone copy one person's local setup. They gave agents access to the same systems humans already depend on: repos, CI, data stores, internal tools, operational surfaces. Context became part of the environment instead of something employees had to keep re-explaining by hand.

That is the right direction.

Spotify's approach offers a different but complementary lesson. Their background agent, Honk, handles fleet-wide migrations — applying the same code transformation across hundreds or thousands of repositories. The scale forced a decision that individual-level AI usage never requires: you have to make context mechanical. Spotify chose large static prompts, version-controlled and testable, with deliberately limited tool access. They restricted their agent to a verify tool (formatters, linters, tests), a Git tool (with restricted subcommands), and a strict Bash allowlist. The reasoning: "the more tools you have, the more dimensions of unpredictability you introduce."

Stripe went the opposite direction — their MCP server "Toolshed" provides nearly 500 internal tools, from documentation to tickets to Sourcegraph code search to build status and feature flags. The maximalist approach works for Stripe because their agents operate inside a massive monorepo handling diverse, complex tasks. Spotify's minimalist approach works because their agents need to produce predictable, mergeable PRs across thousands of different repos.

The difference isn't right vs. wrong. It's that both teams made the decision consciously, at the infrastructure level, rather than letting individual developers' personal setups create the answer by accident.

At team scale, the thing to share is not "how Sam likes to prompt." The thing to share is the machinery that tells both humans and agents what is true.

A four-layer model for team AI

This is the cleanest model I know for separating what should be shared from what should stay personal:

Layer 4: Personal preferences
Layer 3: Team capabilities
Layer 2: Project context
Layer 1: Live infrastructure

Layer 1: Live infrastructure

This is the ground truth layer. CI status. Monitoring. Documentation search. Deployment state. Tickets. Build outputs. Internal APIs.

This layer answers: what is true right now?

It should come from systems, not from prose whenever possible.

At Stripe, Layer 1 is extraordinarily thick. Their 500 MCP tools mean an agent can check documentation, query Sourcegraph for code patterns, read feature flag status, and look up build results — all without a human pre-loading that context into a prompt. The agent discovers what's true the same way a human engineer would: by querying the systems that hold the truth.

At Spotify, Layer 1 is deliberately thin. They keep context gathering in the prompt itself rather than in dynamic tool calls. The agent gets a large, carefully constructed static prompt and a minimal set of verification tools. The tradeoff: less flexibility, more predictability. When you're applying changes across a thousand repos, predictability wins.

At OpenAI, Layer 1 is wired into the product itself. They connected Chrome DevTools Protocol into the agent runtime and exposed logs via LogQL and metrics via PromQL, so agents could validate their own work by actually interacting with the running application. When code throughput got so high that human QA became the bottleneck, they responded by making the application legible to agents — turning the product into its own verification system.

The common thread: Layer 1 is mechanical. It's not a document someone wrote describing the system. It's the system itself, made accessible to agents through tools, APIs, or structured interfaces.

Layer 2: Project context

This is where the durable but slower-moving guidance lives. Build commands. Architecture notes. Naming conventions. Dependency relationships. Rules that are specific to this repo or project.

This layer answers: how do we do things here?

It belongs in version control.

OpenAI's solution to their failed monolithic AGENTS.md was to restructure Layer 2 as progressive disclosure. Instead of one giant document, they created a small, stable entry point (~100 lines) that functions as a table of contents, pointing to a structured docs/ directory that serves as the system of record. Agents start with the entry point and navigate deeper as needed. They enforce freshness mechanically — linters and CI jobs check for drift, and a recurring "doc-gardening" agent finds stale documentation and opens fix-up PRs.

The progressive disclosure pattern solves a real problem with context windows. An agent working on a frontend component doesn't need the backend architecture document. An agent doing a database migration doesn't need the CSS conventions. By structuring Layer 2 as a navigable tree rather than a flat dump, the agent's limited context budget gets spent on the information that actually matters for the current task.

Layer 3: Team capabilities

This is reusable know-how. Review workflows. Templates. Common automations. Shared skills. Things that travel across projects because they encode how the team works, not how one repo works.

This layer answers: what can we reliably do together?

It should be curated, not allowed to sprawl.

Spotify's fleet management system is the most mature example of Layer 3 in production. It applies the same agent-driven code transformation across thousands of software components — the same migration pattern, the same verification criteria, the same PR structure — and tracks merge status centrally. The fleet model takes an individual capability (an agent that can do a code migration) and turns it into an organizational capability (a system that can apply that migration everywhere, monitor progress, and report results).

The curation requirement is real. Layer 3 is where teams most often accumulate junk — a shared Notion page of "useful prompts," a sprawling wiki of tips, a channel of "things that worked for me." Without active curation, Layer 3 becomes a graveyard. The test is simple: can a new team member find and use this capability on day one? If it requires oral tradition to discover or tribal knowledge to operate, it hasn't graduated from Layer 4.

Layer 4: Personal preferences

This is where private speed belongs. Shortcuts. Style preferences. Memory systems. Favorite prompts. Local helpers.

This layer answers: how do I like to work?

It should stay private unless there is a real reason to promote something upward.

This is where teams usually slip. Most failed AI rollouts are really Layer 4 leakage. Someone mistakes personal optimization for team infrastructure. Their setup works beautifully — for them. They share it with good intentions. It creates confusion, because the assumptions encoded in that setup are invisible to everyone else.

The fix is not to ban personal experimentation. Teams need people pushing the edges. The fix is to make the boundary explicit. When someone discovers something genuinely useful at Layer 4, the path should be clear: propose it for Layer 3 (team capability) or Layer 2 (project context), which means making it discoverable, documented, and maintained. If it can't survive that translation, it stays personal.

How to tell whether your setup scales

A simple test works better than most maturity models:

Can a new team member become productive with your AI setup on day one?

If yes, your context probably lives in systems they can access, docs they can trust, and capabilities they can discover.

If no, you did not build team infrastructure. You built impressive personal tooling with a shared veneer.

This test also catches another problem: stale context. If onboarding requires oral tradition, side-channel explanations, or a long tour of "things that are not written down anywhere," the team has already lost control of its knowledge surface.

Ona/Gitpod calls this the "false summit" problem: you rolled out coding agents, engineers are faster, PRs flood in, yet cycle time doesn't budge, DORA metrics are flat, the backlog grows. The gains compound with the individual, not the organization. That gap — between individual speed and organizational velocity — is exactly what the four-layer model is designed to close.

Boundaries

This does not require everyone to use the same model, the same editor, or the same personal workflow.

It also does not mean more documentation solves the problem. Static docs rot too. OpenAI's experience proves this — they went from monolithic documentation to progressive disclosure enforced by automated gardening because static docs alone were worse than useless.

And it definitely does not mean personal experimentation is bad. Teams need people pushing the edges.

It means the boundary between personal optimization and team infrastructure has to be explicit. Share the layers that create common truth. Let the rest stay local.

Where to start

Do not start with a giant operating manual.

Start with one project and make Layer 2 real:

Build and test commands
Major architectural constraints
Naming and directory conventions
Links to the systems that tell the current truth

Then improve Layer 1:

Connect AI tools to CI
Connect them to logs or monitoring
Connect them to your documentation source of record

Only after those are working should you spend much time formalizing Layer 3.

The rule that survives contact with a real team is short:

Share infrastructure, not prompts.

Prompts still matter. Personal experimentation still matters. But if humans and agents are operating from different versions of reality, no prompt library is going to save you.

Chapter 3

The Steersman

The previous chapter dealt with the team — what infrastructure you need when AI stops being one person's trick and becomes an organizational capability. This chapter zooms in on the person. When the team's production system is running, what does the individual actually do?

An engineer opens the laptop and finds three pull requests waiting.

One fixes a flaky test. One rewrites a service method. One failed because the agent missed a contract edge case. The engineer does not start by typing code. They review the diffs, tighten a test, reject one approach, rewrite the constraint that produced the bad result, and kick off another run.

A few years ago, that would have sounded strange. Now it sounds increasingly normal.

The deeper change AI is creating in software is not better autocomplete. It is a shift in where the human sits relative to the work. The engineer is moving out of the middle of the execution loop and up into direction, constraint-setting, and correction.

Norbert Wiener had a word for this kind of role: steersman.

Closing the loop

Wiener helped pioneer cybernetics while working on anti-aircraft fire control during World War II. The key move was not training humans to react faster. It was closing the loop between sensing, prediction, and action so the system could correct itself continuously. The human still chose targets and intervened when needed. The mechanism handled the tracking.

Software is moving in the same direction.

At companies already using background agents in production, parts of the development loop now run without constant human supervision. Agents pick up tasks, make changes inside isolated environments, run tests, and open pull requests. Humans review the result, adjust the rules, and decide what deserves another cycle.

That is a different role than "person who writes most of the code by hand."

The shift is visible in how organizations describe their own engineers. OpenAI writes that humans now "interact with the system almost entirely through prompts: an engineer describes a task, runs the agent, and allows it to open a pull request." Stripe's engineering blog talks about agents that produce end-to-end pull requests — more than a thousand merged per week — with humans reviewing, not writing. Spotify has documented background agents handling migrations across hundreds of repositories while engineers define the transformation and verify the output.

Different orgs, same loop structure. The machine acts. The human steers.

What changed on the ground

The strongest evidence is operational, not philosophical.

I have run seven onboarding sessions across two companies and a personal network over the last three months, introducing people to AI-assisted development. The sessions ranged from individual engineers to a Center of Excellence with 65 people. Eight patterns emerged, and most of them point directly at the steersman role — or at the gap between knowing it exists and being able to occupy it.

The first pattern is that the adoption curve is real and predictable. People move through stages: individual agent use, then swarms of agents, then orchestration across agents, then system design that shapes how agents operate. Most people I worked with were between stage zero and stage one. They had the tools. They had not yet internalized the new posture.

The second pattern — and I think the most important — is what I have started calling the blank input problem. The primary blocker to effective AI use is not model quality. It is not cost. It is not access. It is that people do not know what to ask.

A PM with access to the best tools in the world sits there with a blinking cursor and freezes. An engineer who can write code fluently stares at the chat window because specifying intent — knowing what you actually want, clearly enough for a machine to do it — turns out to be a different skill than doing the work yourself. I watched a CoE with 65 people at a major enterprise where everyone had tool access and nobody knew what to hand off.

This is the blank input problem, and it is the steersman's first real challenge. You cannot steer if you cannot articulate where you want to go. The craft model let you think with your hands — you discovered the solution in the act of building it. The steersman model requires you to know (or at least roughly frame) what you want before the machine starts working. That is a harder cognitive task than most people expect.

The third pattern reinforces the steersman framing from a different angle: constraints enable exploration, they do not slow it. The sessions where people adopted fastest were the ones where I set up guardrails first — version control, sandboxed environments, reversible actions. Once people trusted that mistakes were cheap to undo, they tried more things. The steersman needs a system that is safe to steer aggressively. Without that, people default to the craft model because it feels more controllable.

What the steersman keeps

The human still matters. The question is where.

I think the remaining work clusters in three areas. This clustering is already visible in how the most advanced organizations describe what their engineers actually do, and it maps cleanly onto what Wiener was describing: the human retains the parts of the loop that require judgment about the world outside the system.

Problem selection

Which problems deserve attention? Which ones are symptoms and which ones are causes? Which tradeoff matters right now?

Agents can surface issues. They can summarize logs. They can rank failing tests. They still struggle with the social and product judgment wrapped around priority. Someone has to decide what is worth aiming at.

OpenAI's description of their engineers' work after adopting background agents is instructive. Their people spend time "breaking down goals into building blocks" and "identifying missing capabilities when agents struggle." The question they ask is not "how do I implement this?" but "what capability is missing, and how do I make it legible and enforceable for the agent?" That is problem selection operating at a different altitude.

Constraint definition

The agent needs a shape for good work before it can produce good work consistently.

What architecture should this preserve? What performance budget matters? What should never be touched? What is acceptable debt and what is not?

A lot of engineering judgment lives here already. AI just makes that surface explicit. Things senior engineers used to carry tacitly now have to be encoded in docs, tests, linters, review standards, and prompts that survive contact with the rest of the team.

Spotify's approach to background agents illustrates this well. They deliberately limit tool access — a verify tool, a git tool with restricted subcommands, and a strict bash allowlist. They prefer large static prompts that are version-controlled and testable. The constraint surface is narrow on purpose, because "the more tools you have, the more dimensions of unpredictability you introduce." The steersman at Spotify is not giving the agent maximum freedom. They are giving it a tight channel that produces predictable, mergeable pull requests across thousands of repositories.

Stripe takes the opposite approach — nearly 500 internal tools exposed to the agent through a single MCP server — but the principle is the same. Someone designed that tool surface. Someone decided what the agent could reach and what it could not. The constraint definition is just wider.

OpenAI goes further. They describe encoding "human taste" into the system continuously: review comments become documentation updates, documentation updates become linter rules, linter rules become architectural enforcement. "When documentation falls short, we promote the rule into code." That is constraint definition as a ratchet — judgment gets sanded into the system one correction at a time, and the system gets better at producing work that matches the team's standards without being told each time.

Drift correction

This is the part people underestimate.

An agent can produce code that compiles, passes tests, and is still wrong. It can solve the local problem while harming the system. It can obey the letter of the request while missing the intent.

The steersman's job is noticing the gap between output and intention, then correcting the course. Sometimes that means editing code. More often it means editing the environment that produced the code: the spec, the tests, the docs, the guardrails.

Spotify's verification system is built around this insight. They identify three failure modes in order of severity. An agent that fails to produce a PR is a minor annoyance — easily retried. An agent that produces a PR failing CI is frustrating but visible. The most dangerous failure is an agent that produces a PR that passes CI but is functionally incorrect. It looks right. It compiles. It merges. And it breaks something downstream that nobody catches until production.

Their response is layered verification: deterministic verifiers that activate automatically based on codebase content, then an LLM-as-judge that evaluates whether the agent stayed within scope. The judge vetoes about 25% of sessions. When vetoed, agents self-correct roughly half the time.

That 25% veto rate is the steersman in action — not writing code, but catching drift before it compounds.

OpenAI's team experienced this at scale. Their code throughput became so high that the bottleneck shifted to human QA capacity. They used to spend every Friday — 20% of the week — cleaning up what they called "AI slop" manually. That did not scale. So they automated the correction too: recurring background agents that scan for deviations, update quality grades, and open targeted refactoring PRs. Drift correction feeding back into the system that produces the drift.

What to practice now

If this role shift is real, the skill stack changes with it.

The engineers who get stronger here are not the ones who merely type faster with AI. They are the ones who get better at:

framing a problem crisply enough that a machine can act on it
writing constraints that survive execution across diverse codebases
reviewing generated work for correctness and fit, not just compilation
deciding what needs human judgment and what does not
turning repeated review comments into reusable system rules

That last point matters a lot. In the craft model, taste often stayed inside the person — an accumulated intuition about what "good" looked like, never fully externalized. In the steersman model, good teams gradually push taste into the system. Every correction becomes a potential rule. Every rule that sticks reduces the need for that correction next time.

The blank input problem suggests another skill worth developing: the ability to decompose intent. People who freeze at the empty prompt often have the judgment — they know what good work looks like — but lack the habit of articulating it in a form the machine can use. That is a trainable skill. It is also, I think, the main thing onboarding programs should focus on. Not "how to use the tool" but "how to know what to hand the tool."

Boundaries

Engineers do not disappear.

Implementation does not stop mattering. Someone still has to know whether the implementation is sound. The steersman who cannot read code is not steering — they are guessing.

And this is not the same thing as turning engineering into management. Steering is not generic supervision. It is technical judgment applied at a different layer. OpenAI's engineers are not managing agents the way a project manager manages people. They are identifying missing capabilities, encoding architectural invariants, and designing the verification systems that catch bad output. That requires deep technical knowledge. It just deploys that knowledge differently.

The boundary will not hold forever either. Some of what I am calling "human work" today will automate further. That is already happening. Problem selection is getting more instrumented — agents can surface signals from monitoring and user feedback. Constraint definition is getting more codified — linters and architectural tests encode what used to be judgment calls. Drift correction is getting better tooling — LLM-as-judge systems already catch a quarter of bad output before a human sees it.

Each of those three areas is partially automating. The layers keep collapsing. The steersman model is useful because it describes the current transition honestly. It should not be mistaken for a permanent settlement.

The real adjustment

The hardest part is psychological.

For a long time, being a strong engineer meant being deep inside the loop. You knew the code because you wrote it. You trusted the path because you walked it yourself. AI breaks that identity before it breaks the job.

That is why some of the resistance feels emotional even when people use technical language. The discomfort includes quality concerns, but proximity is the deeper issue. People are being pushed one layer up the stack and are not sure whether that is advancement, loss, or both.

I think it is both.

You lose some intimacy with implementation. You gain influence over a larger system. The question is whether you can operate comfortably at that new level without pretending the old one still defines the job.

I saw this in every onboarding session. The people who adopted fastest were not the most technically skilled. They were the ones most willing to let go of the feeling that they should be the one typing. The ones who stalled were often excellent engineers who could not shake the sense that delegating to a machine meant losing something important about their work. They were not wrong about the loss. They were wrong about what it meant.

Tomorrow morning there will still be pull requests waiting. The difference is that more engineers will meet them the way a steersman meets a current: by setting direction, correcting drift, and deciding what deserves force in the first place.

Chapter 4

The Software Industrial Complex

The steersman is one person at one helm. But what happens when the production system grows beyond what any individual can steer?

People still talk about AI and software as if it were a craftsperson with better tools. That is no longer the whole picture.

At several leading engineering organizations, agents now pick up tasks, work inside isolated environments, run verification, and hand humans pull requests for review. Stripe has written about more than a thousand merged agent-written pull requests per week. Spotify has documented large-scale migration work handled by background agents. OpenAI has described internal systems where humans increasingly specify intent and review output instead of writing every line by hand.

Those are not better chisels for the same craft. They are early software factories.

What makes this different from copilots

Copilots help while you watch. Factories run while you are elsewhere.

That difference sounds small until you look at the operating model.

An interactive assistant depends on your laptop, your attention, and your session. A background agent depends on none of them. It runs on remote infrastructure. It picks up work from a trigger. It uses a full toolchain. It returns an artifact later, usually as a pull request or a failure report.

Ramp captures the distinction cleanly: "A coding agent needs your machine and your attention. A background agent needs neither. It runs in its own development environment in the cloud: full toolchain, test suite, everything. Completely decoupled from your device and your session."

The shift is not mainly about model intelligence. It is about the surrounding system — the sandboxes, the triggers, the verification gates, the context injection, the output channels. A copilot is a better chisel. A factory is a production line with the chisel embedded inside it.

The factory pattern

Across the public examples, the architecture is remarkably consistent. Six organizations built these systems independently — different companies, different codebases, different engineering cultures — and converged on the same skeleton:

A trigger creates work.
The work runs inside an isolated environment.
Context is injected from docs, tools, and repository knowledge.
An agent executes the task.
Verification gates test the result.
A human reviews the output or the system retries.

That is a production line.

Each stage deserves a closer look, because the details reveal how much engineering goes into making the factory run — and how much of that engineering has nothing to do with the AI model itself.

Triggers

Removing the human from the invocation loop is what turns a coding agent into a background agent. The trigger is what makes the factory autonomous.

Stripe, Spotify, and Ramp all use Slack as a primary invocation surface — an engineer drops a message, and an agent picks it up. But triggers go well beyond chat. GitHub events (a PR opened, a CI failure, a review comment) can kick off agent runs. Scheduled cron jobs handle dependency updates and lint sweeps. Ticket assignment routes Jira or Linear issues directly to agents.

OpenAI's Symphony project takes this furthest. Its orchestrator continuously polls Linear on a configurable cadence — default thirty seconds — and dispatches eligible issues automatically based on state, priority, and concurrency limits. Issues move through a state machine: unclaimed, claimed, running, retry-queued, released. Blocked issues with unresolved upstream dependencies are held automatically. The human creates the ticket. The factory does the rest.

Paperclip goes further still, with agents activating on heartbeat schedules, checking their own work queues, and acting autonomously within set parameters. The human operates as a "board of directors" — approving strategy, not dispatching tasks.

Sandboxes

Every organization runs agents in isolated environments. Never on developer laptops.

Stripe uses pre-warmed "devboxes" on AWS EC2 that spin up in ten seconds — identical to human engineer environments but walled off from production and the internet. Ramp runs each session on Modal sandboxes with filesystem snapshots, images rebuilt every thirty minutes so repos are never more than half an hour out of date. Spotify runs agents in containers with limited permissions, few binaries, and virtually no access to surrounding systems. OpenAI makes the app bootable per git worktree so each agent gets a fully isolated instance.

The sandbox is not a nice-to-have. It is the factory floor. Without it, an agent writing code on a developer's machine is just a faster version of the developer. With it, you can run dozens of agents in parallel, each working on a separate task, each unable to interfere with the others or with production. The parallelism is what makes the factory a factory.

Context injection

Every organization identifies context engineering — telling agents what to do and giving them the information to do it well — as the single most important factor in agent success.

OpenAI tried a single large AGENTS.md file and watched it fail. It was hard to verify, rotted instantly, and crowded real task context out of the context window. Their solution: treat AGENTS.md as a table of contents — roughly a hundred lines — pointing to a structured docs directory that serves as the system of record. Progressive disclosure. Agents start with a small, stable entry point and navigate deeper as needed. They enforce freshness mechanically with linters and CI jobs, plus a recurring "doc-gardening" agent that finds stale documentation and opens fix-up PRs.

Spotify takes a different path. They prefer large static prompts — version-controlled, testable, evaluable — with context baked in rather than fetched dynamically from tools. Their reasoning: "the more tools you have, the more dimensions of unpredictability you introduce." The prompt is the context.

Both approaches work. The common principle is that context cannot be informal. It has to be versioned, maintained, and mechanically sound. The factory runs on context the way a physical factory runs on raw materials. Bad inputs produce bad outputs regardless of how good the machinery is.

Verification

Producing code is the easy part. Knowing whether that code is correct — that is where the real engineering investment goes.

Spotify built a layered verification system. Deterministic verifiers activate automatically based on what they find in the codebase (a Maven verifier triggers when it sees pom.xml). The agent does not know what the verifier does — it just calls a "verify" tool. After all deterministic checks pass, an LLM-as-judge takes the diff and the original prompt and evaluates whether the agent stayed within scope. The judge vetoes about 25% of sessions. A stop hook blocks any PR that fails any verifier.

OpenAI encodes correctness into repository structure itself. Each business domain follows fixed architectural layers with strictly validated dependency directions. Custom linters enforce naming conventions, file size limits, and reliability requirements. Error messages are written to inject remediation instructions directly into agent context — the linter does not just say "wrong," it says "here is how to fix it" in language the agent can act on.

Stripe runs local linting in under five seconds, then allows a maximum of two CI rounds with auto-applied fixes. With over three million tests available, they use targeted subsets for fast agent-side feedback before pushing to full CI.

Cursor made an explicit design decision to accept some error rate in exchange for throughput: "When we required 100% correctness before every single commit, it caused major serialization and slowdowns." Their fix was a "green branch" with periodic reconciliation passes — agents trust that other agents will clean up small errors soon.

The right tradeoff depends on context. Spotify applies changes across thousands of repositories where a bad merge can break production at scale — correctness wins. Cursor was running a research project where speed of exploration mattered more — throughput wins. But every organization built verification infrastructure. Nobody ships agent output without checking it.

Single-agent vs. multi-agent

The organizations split on whether to use one agent per task or coordinate multiple agents on larger goals.

Stripe, Spotify, and Ramp all run single-agent systems. One agent, one task, one PR. Parallelism comes from running many independent single-agent sessions simultaneously. This is simpler and more predictable. Spotify's fleet management system applies the same agent-driven transformation across hundreds of software components, opening PRs automatically and tracking merge status — but each individual agent works alone.

Cursor and OpenAI run multi-agent systems with hierarchical coordination. Cursor's final architecture uses a recursive structure: a root planner that owns the entire scope and does no coding, subplanners that take narrower slices and can delegate further, and workers that pick up tasks and drive them to completion. Workers are unaware of the larger system. They work on their own copy of the repo and produce a handoff when done.

This mirrors how software teams operate — a staff engineer sets direction, leads break it into workstreams, individual contributors execute. Cursor notes the resemblance is emergent. Models were not explicitly trained for this structure; they converged on it because it works.

OpenAI uses a similar pattern: agents drive PRs to completion, request additional agent reviews, respond to feedback, and iterate in a loop until all reviewers are satisfied.

Paperclip takes the organizational metaphor literally. Agents are assigned titles, reporting lines, and job descriptions in a formal org chart. Work cascades from company mission to project goals to individual agent tasks — every task traces back to the company mission so agents know not just what to do but why. The human operates as a board of directors. This represents the furthest end of the autonomy spectrum: not just multi-agent coordination, but multi-agent organizations.

The single-agent systems are more mature in production. The multi-agent systems are more ambitious in scope. I think both patterns will coexist for a while — single agents for well-bounded tasks (migrations, bug fixes, test additions), multi-agent systems for larger coordinated efforts (new features, architectural changes, cross-cutting refactors).

Why this matters strategically

If a factory can produce working product changes from specs, tickets, and context, then features start to look less like rare assets and more like manufactured output. Useful output, still worth doing, but less defensible than before.

A factory also changes what teams need. It cannot run on personal preference. It needs versioned context, stable environments, explicit constraints, and verification that catches drift. This is why so many teams discover that individual AI productivity does not scale cleanly. Personal setups can be messy. Shared production systems cannot.

It changes what humans do too. When execution becomes easier to automate, responsibility shifts upward. Humans spend more time on problem selection, constraint definition, verification design, and deciding whether the output matters.

So this is not only an engineering tooling story. It is a product and operating model story too. And the operating model implications are the ones most organizations are slowest to absorb.

The blueprint is getting cheaper

A year ago, building this kind of system looked like a company-specific advantage. The orchestration layer required a dedicated platform team at Stripe or Spotify scale — serious infrastructure investment, maintained by full-time engineers who understood both the agent technology and the specific codebase it operated on.

Now the blueprint itself is getting standardized.

OpenAI's Symphony provides a complete specification — language-agnostic, with an Elixir reference implementation — for turning an issue tracker into an autonomous agent dispatch system. A daemon that continuously polls Linear for work, creates isolated workspaces, and runs coding agents against each ticket without human invocation. The README literally suggests: "Implement Symphony according to this spec." You can paste the spec into a coding agent and ask it to build you a factory. The factory building itself.

Paperclip goes further. It models entire autonomous companies with hierarchical org charts, goal cascading, budget controls, and governance. A single npx paperclipai onboard command. One deployment can run multiple autonomous organizations with complete data isolation between them. Per-agent monthly budget limits with automatic enforcement — when they hit the limit, they stop. An immutable append-only audit log where every tool call and decision is recorded.

Both projects are open-source. Apache 2.0, MIT licensed. They lower the barrier from "build an orchestration platform from scratch" to "configure and deploy an existing one."

That matters because it means the factory is not the moat.

This point is easy to miss. People see a background agent system and assume the advantage lies in having built one. For a short period, maybe it does. Stripe's Minions system has a dedicated "Leverage team" maintaining it. Spotify has a full-time team focused on their agent infrastructure. OpenAI allocated three engineers initially, growing to seven. Those investments produced real advantages — Stripe's thousand-plus PRs per week, Spotify's fleet migrations, OpenAI's million lines of code in five months.

But when the blueprint is a spec you can paste into an agent, the infrastructure barrier drops toward zero. First movers get a temporary edge. The advantage shifts elsewhere:

better judgment about what to build
better verification that catches what automated checks miss
better non-code assets — data, relationships, regulatory position
better operational knowledge about running the system well

That last one — operational knowledge, what some strategy frameworks call "process power" — is subtler than it sounds. The factory blueprint is open-source, but the judgment about how to operate it is not. Spotify's decision to limit tool access for predictability, Stripe's decision to expose 500 tools for flexibility, OpenAI's progressive disclosure pattern for repository knowledge — those choices reflect hard-won understanding of what works in their specific context. The spec alone does not transfer that.

The false summit

Here is the pattern that catches most organizations.

An organization rolls out agents. Pull requests increase. Demos look exciting. Everyone says the company is now AI-native. Then the deeper metrics barely move, because the team bolted a faster production method onto an unchanged operating model.

Ona/Gitpod identifies this directly: "You rolled out coding agents. Engineers are faster. PRs flood in. Yet, cycle time doesn't budge. DORA metrics are flat. The backlog grows. Because gains are compounding with the individual, not the organization."

That usually shows up in familiar ways:

docs nobody trusts, because they were never maintained for agents (or humans)
prompts carrying architectural rules that should live in linters and tests
humans still doing quality control by heroics — reviewing every PR manually because the verification infrastructure does not exist
PMs generating more work instead of better-directed work, because throughput is up and nobody recalibrated the intake process
engineering metrics improving while strategic position stays flat, because the team is shipping features faster in a world where features are commoditizing

This is the false summit. More output, same bottlenecks.

The real summit requires redesigning the system around the factory, not just plugging agents into the existing process. That means investing in verification infrastructure, context management, sandboxed execution, and — hardest of all — changing what humans spend their time on. The people who used to write code now need to design the environment in which agents write code. That is a different job with different skills, and most organizations have not made that transition explicit.

OpenAI puts it directly: "Building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code." The scaffolding is the factory. The code is the factory's output. Confusing the two is how organizations reach the false summit and stay there.

I have seen a version of this in onboarding sessions. An engineer gets excited about agent output, ships a few impressive demos, and then stalls — because the surrounding system (docs, tests, architecture, review process) was not built to absorb that volume of output. The bottleneck was never writing code. It was everything around the code. The factory makes that bottleneck visible by flooding the downstream process with more volume than it was designed to handle.

Boundaries

Fully autonomous software companies are not already here.

Not every codebase should move to background agents tomorrow. Some environments are too risky, too regulated, or too poorly instrumented for that to be wise. The organizations profiled in this chapter are well-resourced engineering teams with existing developer experience infrastructure, strong CI pipelines, and institutional tolerance for experimentation. They are not typical.

Human review does not disappear. If anything, the value of high-quality review rises because humans are no longer spending as much of their scarce attention on raw implementation. Spotify's LLM-as-judge catches a quarter of bad output, but someone designed that judge, calibrated its thresholds, and decided what "within scope" means. The factory does not eliminate judgment. It concentrates it.

And the factory does not solve product judgment. A bad roadmap with better throughput is still a bad roadmap. Undirected production creates inventory, not value. The PM who responds to cheaper features by requesting more features has missed the point entirely.

A practical diagnostic

To tell whether your organization is still in the craft era or has started moving toward the factory era, ask five questions:

Can agents work meaningfully when no human is watching?
Do they run in isolated environments with the real toolchain?
Is the context they use versioned and reviewable?
Does verification block bad output before it reaches humans?
Are your best people spending more time on direction and review than on routine implementation?

If most answers are no, you are still improving craft tooling.

That is fine. It is just a different thing.

Do not confuse the two. Better autocomplete is not industrialization. Background production systems are.

The assembly line has arrived in software. The question is no longer whether the tools are impressive. The question is whether your organization is still arranged for the craft era while the factory is already running.

Chapter 5

Eight frameworks, one conclusion

The previous chapters described what's happening: features are commoditizing, the factory model is replacing traditional development, the human role is compressing to judgment. This chapter asks whether any of that holds up under scrutiny.

Hamilton Helmer spent a career cataloging what makes companies durable. He identified seven sources of power. I ran the software industrialization thesis against all seven, expecting to find that most of them crumble when AI collapses the cost of building software. Five of them survive. Two weaken. And the two that weaken are exactly where most product teams spend their roadmap.

That gap between where defensibility lives and where companies invest is the core of what I've been writing about for the past several months. But saying "features are commoditizing" is an assertion. Running it through Helmer, Porter, Wardley, Christensen, Thompson, Perez, McGrath, and Grove is a stress test. Eight canonical strategy frameworks, built over decades by people who've never heard of background agents or CLAUDE.md files. If the thesis only survived one of them, you could chalk it up to cherry-picking. It survived all eight.

Here's what that means.

The convergence

Each framework approaches the same question from a different angle. Thompson's Aggregation Theory explains how value migrates when supply gets commoditized. Wardley Mapping tracks how components evolve from novel to commodity along a predictable axis. Christensen's Disruption Theory describes what happens when a new production method enters an industry. Perez's Technological Revolutions places the current moment in a 250-year pattern of how economies absorb radical technological change. Porter's Five Forces maps the structural pressures reshaping industry profitability. Helmer's 7 Powers catalogs what makes a moat a moat. McGrath's Transient Advantage argues that durable moats are the exception, not the rule. Grove's Strategic Inflection Points identifies the signals that distinguish an incremental shift from a structural one.

Different thinkers, different decades, different intellectual traditions. They converge on the same conclusion: when the cost of building software approaches the cost of describing it, feature-based competition becomes structurally unwinnable. Not "harder" or "less effective." Structurally unwinnable. The way competing on typesetting speed became structurally unwinnable after desktop publishing.

What Wardley sees

The most direct connection is Wardley Mapping, because the thesis is a Wardley argument whether or not you draw the map.

Wardley's central claim is that all components in a value chain evolve through four stages: Genesis (novel, hand-built), Custom Built (growing, expensive), Product (stable, multi-provider), and Commodity (ubiquitous, utility). Supply and demand competition pushes everything rightward. You can slow the movement. You can't stop it.

Software features have been moving rightward for years. What AI did was stomp on the accelerator. When Stripe merges a thousand agent-written PRs a week, features aren't in the Product stage anymore. They're approaching Commodity. The cost structure collapsed, the replication timeline compressed from quarters to hours, and the differentiation window is closing behind it.

Wardley also identifies a pattern called "efficiency enables innovation": when a component commoditizes, it creates a platform for new higher-order activity. Commodity computing enabled SaaS. Commodity SaaS is now enabling something above it. The thesis calls that something "factory infrastructure" -- the orchestration layers, verification systems, and context engineering that direct agents toward the right problems. In Wardley's terms, factory infrastructure sits at the Genesis end of the evolution axis while feature production slides toward Commodity. The value moved up the stack.

If you drew the map -- features, code generation, orchestration, verification, context engineering, judgment, each placed on the evolution axis -- you'd see all five predictions as one coherent rightward movement. That map might be the single most useful visual artifact for the thesis, because it replaces assertion with spatial logic. Where a verbal argument can feel slippery ("everything commoditizes eventually, so what?"), a Wardley map pins each component to a specific position and forces you to confront the relationships between them. Factory infrastructure depends on commoditized code generation the same way SaaS depended on commoditized cloud compute. The dependency chain makes the claim falsifiable: if code generation stalls at the Product stage instead of reaching Commodity, the factory model doesn't fully arrive. But everything we can observe -- pricing trends, capability curves, competitive behavior among model providers -- points rightward.

What Helmer reveals

Helmer's 7 Powers is a taxonomy of defensibility. Each power has a benefit (how it creates superior economics) and a barrier (what prevents competitors from neutralizing it). Walk all seven against the thesis and the picture is specific:

Scale Economies collapse. When a three-person team with factory infrastructure matches the output of fifty engineers, volume-based cost advantages in development evaporate. The barrier side breaks too -- the factory blueprint is open-source.

Network Economies survive intact. You can't point a factory at a social graph. Spotify's 600 million listening histories, Meta's two billion daily active users -- these compound in ways code replication can't touch.

Counter-Positioning becomes the critical transitional power. AI-native companies building with factory infrastructure while incumbents optimize per-seat SaaS models -- that's textbook Helmer. The incumbent can't adopt the factory model without cannibalizing seat-based revenue. But the window is narrow. Once the factory blueprint commoditizes, counter-positioning evaporates because everyone can adopt the model.

Switching Costs bifurcate. Deep enterprise integration survives. Ripping out Salesforce costs more in organizational disruption than any AI-native replacement could save. But surface-level product switching costs vanish when an agent can rebuild an app's functionality in a weekend.

Branding weakens at the feature layer. You can't build brand loyalty to a feature that six competitors clone by Friday. Branding survives at the entity level -- trust in Apple's privacy posture, Stripe's developer experience philosophy -- but it detaches from features and reattaches to values.

Cornered Resource stays potent. Proprietary data, regulatory positions, physical infrastructure. Google's search ranking corpus. TSMC's fabs. The defensibility matrix maps cleanly here.

Process Power gets interesting. The factory makes process explicit -- CLAUDE.md files, verification gates, orchestration specs. In theory, this should weaken Process Power by making it copyable. In practice, the organizations running sophisticated factories first are accumulating operational knowledge that the open-source spec doesn't transfer. The blueprint is commodity. The judgment about how to run it is not.

Five of seven powers survive or strengthen when production costs collapse. Most companies invest their roadmaps in the two that weaken. Helmer gives that strategic error a taxonomy. The thesis gives it a mechanism.

Where Christensen breaks

Not everything maps cleanly. Disruption Theory is the framework with the most productive tension, and I think being honest about where a framework doesn't fit matters more than forcing a clean narrative.

Christensen's core insight is that incumbents fail because a cheaper, simpler product enters at the low end of the market and improves along an S-curve until it's good enough for the mainstream. The incumbent ignores the threat because the disruptor's initial market is unattractive.

Background agents don't follow this path. They didn't enter at the low end with an inferior product for underserved customers. Copilot shipped inside VS Code. Claude Code runs in the same terminal senior engineers use. The entry point was the mainstream market, not a segment below it. There's no period of rational incumbent blindness -- everyone can see what's happening.

What's actually happening is incumbent paralysis, not incumbent blindness. Organizations see the shift and can't reorganize fast enough. Their hiring, incentives, culture, and career ladders assume features are hard to build. The value network is optimized for a world that's disappearing. That part of Christensen maps perfectly. The market-entry mechanics don't.

The zero-marginal-cost dynamic also breaks the framework's economic logic. Classic disruption assumes the entrant has a different cost structure that lets it profit in markets the incumbent finds unattractive. Background agents don't have a "different cost structure." They approach zero marginal cost for production. When building a feature costs about as much as describing it, the supply-and-demand segmentation that disruption theory models stops applying. There's no price gradient left to exploit -- no low-end market where the disruptor can grow undisturbed, because the cost floor for everyone dropped to roughly the same level simultaneously.

Where Christensen still cuts deep is the value network concept. Companies measuring feature velocity, sprint throughput, and DORA metrics while the competitive axis shifts to non-code defensibility. That's measuring success by the criteria of the world being displaced. Every organization celebrating its shipping speed while ignoring its moat position is living the innovator's dilemma in real time. They just got there through a different door than Christensen's model predicted.

The turning point

Carlota Perez's framework provides historical grounding that the other frameworks can't. She's identified five great technological revolutions since the 1770s, and each one followed the same two-act structure. Installation: financial capital floods in, speculative infrastructure gets overbuilt, a bubble forms and pops. Deployment: production capital takes over, the infrastructure becomes cheap and ubiquitous, and the economy reorganizes around the new paradigm.

Between the two acts sits a turning point. Financial crash, institutional recomposition, and the beginning of broad-based adoption.

The thesis prediction that the factory blueprint commoditizes maps directly to Perez's turning point. When Symphony and Paperclip make the orchestration layer an npx command, that's the moment infrastructure built during the frenzy becomes cheap enough for production capital to absorb. The parallel to overbuilt railway track is structural: speculative investment in AI (the models, the tooling, the orchestration experiments) is creating the substrate on which a reorganized software economy will run.

Perez also raises a question the thesis currently underplays: is AI-driven software industrialization a new revolution, or the deployment phase of the ICT revolution that began in 1971? If it's the latter, the implication shifts. The dot-com crash was the frenzy peak. The current AI buildout is the beginning of deployment, where fifty-five years of installation infrastructure finally reorganizes the real economy. And if that's right, the venture-funded AI startup model is an installation-era artifact. The winners in deployment eras aren't the startups -- they're the incumbents who integrate the now-cheap infrastructure into existing operations.

Either way, Perez adds something the other frameworks miss: the institutional dimension. The thesis describes the factory. But factories need labor relations, quality standards, liability frameworks, and governance structures. If agents write the code, who's liable for defects? How do employment contracts change? What happens to open-source norms when a factory can consume and replicate any public codebase in hours? Perez's framework insists that deployment requires institutional recomposition, and that this takes a decade even after the technology is ready.

The value chain inverts

Porter's Five Forces and Value Chain analysis produced the most operationally useful insight.

In the traditional software value chain, operations -- writing code -- is the dominant primary activity. It absorbs the most resources and creates the most differentiation. The factory model demotes it to commodity logistics. What were support activities become primary. Context engineering moves to the center of value creation. Verification design becomes the activity that separates defensible output from copyable output. Human resource management transforms from "hire engineers who write code" to "hire people who exercise judgment about what agents should build and whether the output is good enough."

This inversion is a diagnostic. Look at where a company invests. If they're optimizing developer velocity and code output, they're optimizing an activity that's already sliding toward commodity. If they're investing in context engineering, verification infrastructure, and judgment capacity, they're building the new primary activities. The first group is running a better factory for the wrong product. The second group is designing the factory itself.

Porter's Five Forces are all moving at once. Barriers to entry are collapsing (a motivated engineer with a weekend). Buyer power is increasing (your customer's alternative is now "build it ourselves with the same tools"). The threat of substitutes has a new category -- the customer builds their own replacement. Supplier power is concentrating upstream in the model providers (few alternatives, high switching costs, no credible backward integration). Rivalry is reshaping around who can direct the factory at the right problems fastest, not who can build faster.

The supplier power concentration is probably the most underappreciated force in the set. When three or four model providers control the substrate on which every software factory runs, and switching between them requires re-engineering your context layer, those providers hold structural power that the rest of the industry hasn't fully priced in. Porter would recognize the shape immediately: a concentrated supplier base selling to a fragmented buyer market with high switching costs and no credible threat of backward integration. The model providers are the new Intel, and most of the software industry is in the position of PC OEMs circa 1995 -- dependent on a component they can't build themselves.

The acceleration and the transient edge

Two frameworks remain that I haven't given dedicated sections: Grove's Strategic Inflection Points and McGrath's Transient Advantage. They both address the same question from opposite sides -- Grove asks how to detect the shift, McGrath asks what happens after you've detected it.

Grove's 10X force test is the simplest diagnostic in the set. When a single factor in your competitive environment changes by an order of magnitude, you're at a strategic inflection point. The cost of producing a working software feature has dropped by more than 10X for organizations that have adopted factory infrastructure. By Grove's own standard, this qualifies. But Grove also assumes that navigating the inflection point creates a new period of stability -- that you cross the valley, find solid ground on the other side, and build from there.

The thesis suggests something less comfortable. The factory blueprint commoditizes fast enough that there may not be a stable post-inflection position. First movers get a temporary edge, then the infrastructure becomes available to everyone, and the advantage dissipates. This is where McGrath picks up what Grove puts down. Her framework was built for exactly this pattern: there is no stable endpoint, only a portfolio of advantages managed through continuous reconfiguration. The companies that survive aren't the ones that find the right position and defend it. They're the ones that treat every position as temporary and invest in the organizational capacity to move.

McGrath's framework also explains something the other seven don't: why smart companies fail to respond even when they can see the shift coming. It's not ignorance. It's that their exploitation machinery -- the teams, processes, metrics, and incentive structures optimized for the current advantage -- actively resists the transition. The organization is optimized for where value was, not where value is going. Grove tells you to detect the inflection point. McGrath tells you that detection is the easy part.

What's missing from the thesis

Running eight frameworks against the predictions didn't just validate them. It surfaced three gaps -- places where the thesis, as stated in the previous chapters, leaves important ground uncovered.

Process Power needs to be named. The defensibility matrix identifies four durable categories: networks, data flywheels, relationships, and physical infrastructure. But the framework analyses revealed a fifth: the accumulated organizational knowledge of how to run the factory well. Stripe's operational sophistication isn't captured in the open-source spec. This is Helmer's Process Power applied to the production system itself, and it deserves explicit recognition even though it's more fragile than the other four.

Why more fragile? Because Process Power in traditional industries -- Toyota's production system, TSMC's yield engineering -- is embedded in decades of tacit knowledge, supplier relationships, and organizational muscle memory. Factory-era Process Power is younger and more legible. The specs are written down. The orchestration patterns are open-source. What isn't transferable is the judgment layer: knowing which verification gates matter for your domain, knowing when to override the factory's defaults, knowing how to structure context so agents produce output that fits your product's specific quality bar. That judgment accumulates through operation, not through reading the blueprint. But it accumulates faster than traditional Process Power did, and it's probably easier to replicate. Calling it a moat is generous. Calling it a transitional advantage is more honest -- significant during the period when most organizations haven't built factories at all, diminishing as factory operation becomes a standard organizational competency.

Healthy disengagement is the missing prescription. The thesis tells companies what to do: invest in non-code moats. It doesn't address why they can't. Rita McGrath's Transient Advantage framework names the organizational problem -- teams, incentives, roadmaps, and identities are organized around feature delivery. Letting go of that is harder than adopting the factory model.

The diagnosis is "features commoditize." The prescription is "invest elsewhere." The bridge between them -- the organizational discipline of recognizing when a position is eroding and redeploying resources before it becomes a liability -- is what McGrath calls healthy disengagement, and the thesis needs it. This isn't just a strategy problem. It's an identity problem. Product managers who've built careers on shipping features, engineers whose status comes from code contributions, design teams whose value is measured in screens shipped -- all of them face a world where those activities generate less defensible value every quarter. The organizational challenge isn't knowing what to do. It's letting go of what you were. McGrath argues that companies capable of healthy disengagement share a common trait: they treat advantages as having lifecycles, not as permanent positions. They budget for exit from current activities the way they budget for entry into new ones. Most software companies, built during an era when features were hard and shipping speed was a competitive advantage, have no muscle for this. They'll need to develop it, and the ones that develop it first will reallocate resources to non-code moats while their competitors are still optimizing sprint velocity.

The speed of evolution may break traditional strategy timing. Wardley's evolution axis unfolds over decades. McGrath's advantage lifecycles are measured in years. Grove's strategic inflection points give companies months to respond. The thesis describes feature commoditization happening in days. If evolution speed is itself accelerating, the window for strategic response may be too narrow for the deliberate processes these frameworks prescribe.

This is probably the most uncomfortable implication of the analysis. Every framework here assumes that organizations have some window -- however narrow -- to detect a shift, deliberate about it, and respond. Grove gives you months. McGrath gives you quarters. Wardley gives you years to watch a component slide rightward before it reaches commodity. But when a competitor can replicate your latest feature release over a weekend using the same tools available to everyone, the detection-to-response window collapses toward zero. The frameworks still describe the dynamics correctly -- they just may not leave enough time for the responses they prescribe. Strategy itself might need to operate at a different clock speed, closer to operational tempo than annual planning cycles. None of the eight frameworks fully accounts for this, and I'm not sure anyone has a good answer yet.

So what

The convergence across eight frameworks eliminates one category of objection entirely. "Features are commoditizing" is not an assertion anymore. It's a structural claim supported by demand-side economics (Thompson), evolutionary dynamics (Wardley), moat taxonomy (Helmer), production method analysis (Christensen), historical pattern matching (Perez), industry structure analysis (Porter), advantage lifecycle theory (McGrath), and inflection point detection (Grove).

The strategic question sharpens: what do you have that can't be rebuilt in a quarter?

If you're a product leader, the audit is specific. Map your assets to Helmer's seven powers -- which ones do you actually have? Run your value chain through Porter's inversion -- are your investments in the old primary activities or the new ones? Place your key components on Wardley's evolution axis -- are you optimizing something that's already moved to commodity?

If you're an individual contributor, the question is where your judgment creates the most value. The factory doesn't eliminate human work. It compresses it to the parts machines can't do yet: problem selection, constraint definition, drift correction, and taste. The engineers thriving in this model think about the production system, not the code.

If you're a founder, the counter-positioning window is real but closing. The moment the factory blueprint commoditizes -- and it's commoditizing now -- the advantage of having built one evaporates. What remains is the non-code moat. Networks, data, relationships, operational knowledge, atoms. Everything that isn't a prompt.

Eight frameworks built over six decades by people who studied railroads, steel mills, semiconductor fabs, and internet platforms. All of them looking at what AI is doing to software and reaching the same conclusion. Features are the least defensible investment a software company can make. The interesting question is what you're going to do about it.

Chapter 6

The Defensibility Inversion

The previous chapters diagnosed a shift and then stress-tested it. We looked at what happens when features commoditize, when teams reorganize around factory infrastructure, when the human role compresses to judgment. We ran those claims against eight canonical strategy frameworks — Thompson, Wardley, Helmer, Christensen, Perez, Porter, McGrath, Grove — and none of them broke. Five independent intellectual traditions converged on the same predictions.

This chapter turns prescriptive. The question stops being "is this happening?" and becomes "what do you actually do about it?"

Stripe merges a thousand agent-written pull requests a week. Ramp attributes 30% of all merged PRs to its background agent. OpenAI built an entire internal product — three engineers, a million lines of code, five months, zero manually-written code. Spotify reports 60-90% time savings on migrations across hundreds of repos.

If you're an engineering or product leader at a software company, you've probably presented similar numbers to your own board. Maybe not at that scale, but the shape is the same: we adopted AI tooling, our teams are shipping faster, here's the graph going up. The board is excited. Your CEO wants to double down.

Here's the problem. Every one of your competitors has access to the same tooling. The orchestration infrastructure that took Stripe a dedicated platform team to build can now be scaffolded from a spec file or a single terminal command. What required months of platform engineering is becoming an afternoon project. Your 40% improvement in time-to-merge is impressive — and reproducible by anyone with a weekend.

You're getting dramatically faster at producing the thing that matters least.

The wrong question

The question most leadership teams are asking is "how do we use AI to ship faster?" It's the obvious question. It's also the wrong one. Shipping faster only matters if what you're shipping is defensible. When the cost of building features approaches the cost of describing them, speed of production stops being a moat. Everyone has it.

The right question — the one worth bringing to your next board meeting — is: what do we have that can't be rebuilt in a quarter?

The previous chapter ran that claim through eight canonical strategy frameworks. They all agreed. This chapter translates the convergence into specific changes — to budgets, metrics, hiring, and what you stop counting as strategic investment.

Implication 1: your value chain is upside down

In the traditional software value chain, writing code is the primary activity. It's where the budget goes, where the headcount lives, where the differentiation supposedly happens. Everything else — documentation, testing infrastructure, architectural standards — is a support function. Important but secondary.

The factory model flips this. Code production is becoming commodity logistics. The activities that were support functions are becoming primary. The quality of your documentation determines whether agents produce value or garbage. Your verification infrastructure separates defensible output from copyable output. Problem selection — deciding what to aim the factory at — is now the highest-leverage activity in the entire chain.

This has immediate consequences for your next budget cycle.

Look at where your engineering spend goes. If 80% is allocated to developers writing code and 20% to everything around it, you have the ratio backwards. The organizations getting this right are investing in three areas: context infrastructure (versioned documentation that agents can navigate, enforced mechanically rather than by hope), verification systems (quality gates that catch whether the output is actually useful, not just whether it compiles), and judgment capacity (people whose job is to decide what to build and whether the result is good enough).

The practical test: take your current engineering org chart and draw a line between "people who produce code" and "people who decide what code to produce and whether it's any good." If the first group is five times larger than the second, you're staffed for a craft era that's ending. The organizations I've watched navigate this transition successfully didn't hire fewer engineers — they shifted existing engineers toward context engineering, verification design, and problem specification. The code still gets written. It just isn't the bottleneck anymore, and staffing it like a bottleneck is an expensive mistake.

The diagnostic you can run today: what would your engineering org look like if code production were free? Whatever you'd keep is where the value lives. Whatever you'd cut is where you're currently over-invested.

Implication 2: the counter-positioning window is real and closing

Hamilton Helmer, who wrote the canonical taxonomy of business moats, describes counter-positioning as a specific competitive dynamic: a newcomer adopts a model the incumbent can't match without damaging its existing business. If you run a per-seat SaaS product, this is about you. AI-native companies building with factory infrastructure can deliver comparable functionality at a fraction of your cost. You can't adopt their model and pass the efficiency gains to customers without cannibalizing your own seat-based revenue. You're structurally trapped.

This is the best window for new entrants in a decade. A small team with domain expertise and factory tooling can build a competitive product in weeks, not years. Incumbents see it happening and can't respond without restructuring their entire business model.

But the window has an expiration date. Once agent orchestration infrastructure finishes commoditizing — and it's well underway, with open-source frameworks turning the whole stack into an afternoon project — counter-positioning evaporates. Everyone can adopt the model. The advantage of being AI-native disappears when being AI-native is the default.

If you're building a new company, the clock is running. Your structural advantage isn't the factory itself. It's whatever you build with it that doesn't depend on the factory being rare. Network effects you accumulate during the window. Proprietary data your early users generate. Workflow embeddedness that creates switching costs. The factory gets you in the door. What you do once you're inside is what keeps you there.

The specific playbook looks something like this: launch fast, acquire users while incumbents are paralyzed by the cannibal math, and use the usage data from those early customers to build a data flywheel that late entrants can't bootstrap. Crosby, the AI-native NDA company, didn't just build a contract tool — they built a system that gets smarter about contract quality with every document it processes. By the time a competitor scaffolds a similar product with the same open-source factory tools, Crosby has thousands of real-world contracts' worth of learned judgment. The factory was the entry mechanism. The accumulated data is the moat.

If you're an incumbent, the trap has a specific escape route: don't try to preserve the old model. Cannibalize yourself before someone else does. Netflix killed its own DVD business. Apple cannibalized the iPod with the iPhone. The per-seat SaaS companies that survive will be the ones that restructure pricing around value delivered rather than humans served, even though that means near-term revenue compression. If this sounds painful, consider the alternative.

Implication 3: you need a disengagement plan for features

This is the implication organizations find hardest to act on. Their entire operating structure resists it.

Rita McGrath, who studies how competitive advantages erode, calls it healthy disengagement: the organizational discipline of recognizing when a position is weakening and redeploying resources before it becomes a liability. Most companies can't do this with feature work because their teams, incentives, roadmaps, career ladders, and identities are organized around feature delivery. Your PM's title is "product manager," and the product is features. Your engineer's performance review measures code shipped. Your roadmap presentation to the board is a list of features.

Telling your organization to invest in non-code moats is like telling someone to sell their house while they're standing in the living room. They understand the logic. They can't do it because they live there.

The practical version of disengagement isn't "stop building features." It's shifting the ratio. Start tracking what percentage of engineering investment goes to feature work versus context infrastructure, verification systems, data flywheel acceleration, and relationship depth. Set a target for shifting that ratio over the next four quarters. Make it visible at the leadership level — on the same slide where you show your shipping velocity.

What does this look like in a real planning cycle? Say you're allocating Q3 engineering capacity. Historically, 85% goes to feature delivery and 15% to infrastructure, tooling, and documentation. The disengagement move is committing to 70/30 this quarter, 60/40 next quarter, and tracking the ratio publicly. The features still get built — the factory handles that with fewer people. The freed capacity goes to verification systems that catch quality issues before users do, context infrastructure that makes agent output reliable instead of hopeful, and data pipelines that turn user behavior into compounding product intelligence.

The specific test you can apply to your own roadmap: for every feature, ask whether a competitor could replicate it in a sprint with factory tooling. If yes, you're investing in a depreciating asset. That doesn't mean don't build it — users still need features. It means don't count it as a strategic investment. Count it as maintenance. Reserve the word "strategic" for things that compound or resist replication.

Implication 4: the measurement stack is wrong

Engineering dashboards still measure what was valuable in the craft era: developer velocity, sprint throughput, cycle time, deployment frequency. All of them answer the question "how efficiently are our humans producing code?" — which was the right question when humans producing code was the bottleneck.

The factory model makes code production the solved problem. The bottleneck moves to judgment: are we aimed at the right problems, and is the output good enough?

The metrics that matter now are different, and most organizations aren't tracking any of them:

Verification coverage: what percentage of agent output gets meaningful quality checks? Not test coverage in the traditional sense — coverage of the judgment surface. Are humans reviewing the things that require human judgment and letting automation handle the things that don't? The operational version of this metric: track the ratio of agent PRs that get rubber-stamped versus genuinely reviewed. If 95% are rubber-stamped, your verification system is either excellent (everything below the judgment threshold is handled automatically) or nonexistent (nobody's actually checking). You need to know which one.

Context freshness: how stale is your organizational knowledge? Spotify's internal engineering research found that architectural decisions in large codebases decay at roughly 23% every two months. What's your system for catching that decay before it causes incidents? The leading indicator here is agent failure rate on tasks that used to succeed — when agents start producing worse output in areas where they previously worked fine, your context has probably drifted.

Moat investment ratio: what percentage of your engineering budget goes to defensible assets versus replicable features? If you charted this number quarterly, which direction is the line going? If it's flat or declining, your AI adoption is making you faster at the wrong work.

Factory direction accuracy: of the things your agents produce, what percentage solves a problem users actually have? Undirected production creates inventory, not value. This is the PM's core metric in the factory model — and most PMs aren't measuring it. The specific failure mode I've seen repeatedly: teams celebrate shipping velocity while customer satisfaction stays flat or declines. The factory is producing at record speed. It's just producing the wrong things.

None of these show up in standard engineering dashboards. Building them is itself a strategic investment.

Implication 5: the hiring profile changes

The craft model hired for execution ability. Can this person write code? How fast? How clean? The factory model hires for judgment ability. Can this person decide what to build? Can they review agent output and know whether it's right? Can they design verification criteria that maintain quality at scale?

Your current interview process probably selects for the wrong thing. A take-home coding challenge tells you whether someone can write code. It tells you nothing about whether they can evaluate code they didn't write, specify intent precisely enough that an agent produces the right thing, or recognize when a technically correct solution solves the wrong problem.

The new interview should look more like: here's a set of agent-generated PRs. Which ones would you merge? Why? Here's a product problem and a factory that can build anything. What do you aim it at? How do you know if the output is valuable?

For PMs specifically, the shift is from "can you prioritize a backlog?" to "can you design a factory's objectives?" The PM who thinks in features is working in the least defensible quadrant. The PM who thinks in problems, verification criteria, and compounding advantages is designing the production system. When you're filling your next PM role, the second person is who you want.

There's a subtler point here about team composition. In the craft era, you wanted a mix of senior and junior engineers — seniors for architecture and mentorship, juniors for execution volume. In the factory model, the execution volume comes from agents. What you need from humans shifts toward judgment density: people who can look at agent output and quickly determine whether it's correct, whether it solves the right problem, and whether it introduces risks the agent can't see. That's predominantly a function of experience and domain knowledge, not raw coding speed. Your hiring pipeline should reflect that. The engineer who's spent ten years in your domain and can review fifty agent PRs a day with high accuracy is probably more valuable than the engineer who can write beautiful code from scratch but has never seen your problem space.

Implication 6: the institutional gap is a decade wide

Carlota Perez, the economic historian who mapped how economies absorb radical technological change, adds a dimension most technology commentary skips: the institutional one. Every major technology transition required not just new tools but new regulations, labor norms, and governance structures. Railway technology was mature by the 1840s. The deployment golden age didn't arrive until the 1850s-60s, after new corporate law, limited liability rules, and labor regulation caught up.

The factory blueprint is technically ready now. The institutional infrastructure around it is not. Who is liable when agent-written code causes a production incident? How do employment contracts account for a role that's 80% judgment and 20% execution? What happens to open-source licensing when a factory can consume and replicate any public codebase in hours? How do compliance frameworks designed for human-authored code apply to agent-authored code?

If you operate in a regulated industry — healthcare, finance, defense — these aren't abstract questions. They're blockers. The companies that figure out the institutional layer first — that build the compliance frameworks, the liability models, the governance structures — will have a head start measured in years, not sprints. This is one of those areas where boring, unsexy work creates durable advantage. If you can solve the compliance problem for agent-written code in your vertical, you have a moat that no amount of shipping velocity can replicate.

The practical implication: if you're in a regulated industry, your AI strategy should probably spend as much time on governance as on tooling. The company that builds the first credible audit trail for agent-written code in healthcare — one that satisfies regulators, not just engineers — has built something that can't be replicated by scaffolding a factory from a spec file. The same applies to financial services firms that solve the liability question for agent-generated trading logic, or defense contractors that establish the provenance chain for agent-written mission-critical software. These aren't glamorous problems. They're the kind of problems that create ten-year moats precisely because nobody wants to work on them.

The uncomfortable timeline

Here's what I'm most uncertain about: how fast all of this happens.

The strategy frameworks I studied typically model transitions over years or decades. Wardley maps how technology components evolve from novel to commodity across a generation. McGrath models how competitive advantages erode over quarters. Grove's work on strategic inflection points gives companies months to respond.

The evidence suggests feature commoditization is happening in days. A startup ships Monday; clones exist by lunch. If that pace holds — and the data from Stripe, Spotify, and OpenAI says it will — then the window for acting on these implications is narrower than any of the frameworks predict.

The companies that look back on this period and say "we moved too slowly" will be the ones that adopted AI enthusiastically, aimed it at feature delivery, celebrated the velocity improvements, and woke up one morning to discover that every competitor had the same velocity and the real game had moved somewhere else entirely.

What do you have that can't be rebuilt in a quarter? And if the honest answer is "just code," how fast can you start building something that isn't?

Chapter 7

The Work

This book has been tracking a single force as it moves through layers. Features commoditize. Teams reorganize around factory infrastructure. The individual role compresses to judgment. The previous chapter laid out six concrete implications — changes to budgets, metrics, hiring, and strategy that follow from the industrialization thesis. Those implications all assume that software companies are still, fundamentally, software companies. They sell the tool.

This chapter asks what happens when that assumption breaks.

Stripe merges a thousand agent-written PRs a week. But Stripe sells payments, not pull requests. The factory's output isn't the product. The product is the work those PRs enable: moving money, reconciling accounts, catching fraud. The pull request is overhead.

Julien Bek at Sequoia published a piece arguing that the next trillion-dollar company will be a software company masquerading as a services firm. Not selling the tool. Selling the work. I've been writing about what happens when code gets cheap. Bek is writing about what happens next: the work itself gets cheap enough to sell directly.

The factory makes the wrong thing

But software was always a proxy. Nobody buys Salesforce because they love databases. They buy it because managing customer relationships by hand is slow and error-prone. The software automates the work. What happens when AI can do the work without the software?

That's the next domino. The factory stops producing software-as-product and starts doing the work that software was built to enable. The production line turns outward.

The pattern is visible if you squint at the last twenty years of SaaS. Every successful software company was really an arbitrage on human labor — taking something that required people (bookkeeping, customer management, project tracking, design iteration) and replacing it with a tool that made fewer people necessary. The tool was never the point. The reduced headcount was the point. AI removes a step from the arbitrage: instead of selling a tool that makes the work cheaper, sell the completed work directly.

Copilots and autopilots

Bek draws a clean line. A copilot sells the tool. An autopilot sells the work.

Harvey sells AI to law firms. The lawyers use it, the lawyers take responsibility, the lawyers bill the client. That's a copilot. Crosby drafts NDAs for the company that needs them, not for outside counsel. WithCoverage sells insurance to the CFO, not to the broker. The professional is no longer the customer. The professional is the cost being removed.

For every dollar companies spend on software tools, they spend six on services — on the people doing the work the tools were supposed to make easier. Copilots capture the tool budget. Autopilots capture the work budget.

The operational difference between copilots and autopilots is sharper than it sounds. A copilot augments a human workflow: the lawyer still reads the brief, the engineer still reviews the PR, the accountant still signs off on the reconciliation. The tool makes them faster, but the human remains in the loop and owns the outcome. An autopilot removes the loop. The customer specifies what they want — "close the books," "draft the NDA," "find me a policy that covers this risk profile" — and the system delivers a finished result. No professional in the middle. No tool the customer needs to learn.

This distinction maps directly onto the business model. Copilots price like software: per seat, per user, per month. Autopilots price like services: per outcome, per transaction, per deliverable. The economics are different by an order of magnitude. If you're selling a $50/month copilot seat to a law firm, you're competing for the tool budget. If you're selling completed NDA drafts at $200 each, you're competing for the work budget — which is six times larger.

I've spent time onboarding people to AI tools across two companies. That's copilot work. I teach someone to use Claude Code, they get faster, they own the output. The autopilot is what happens when the customer never touches the tool. They're buying the outcome.

The wedge is outsourcing. If a task is already outsourced, the buyer has already accepted three things: the work can be done externally, there's a budget line for it, and they're purchasing an outcome. Replacing an outsourced vendor with an AI-native service is a vendor swap. Replacing headcount is a reorg. Vendor swaps close in weeks. Reorgs take quarters and require executive sponsorship. That's where autopilots start — not by disrupting employment, but by eating outsourcing contracts. The politics are simpler. Nobody inside the company loses their job. A line item on a vendor spreadsheet just gets cheaper.

The addressable markets are in the hundreds of billions — insurance brokerage, accounting, healthcare billing, IT managed services — and those are just the categories where the outsourcing wedge already exists.

Intelligence and judgment

Bek splits work into intelligence and judgment. Intelligence is rule-following at scale: translating a spec into code, coding clinical notes into ICD-10 categories, filling out insurance applications. The rules are complex, but they are rules. Judgment is knowing which spec to write, which diagnosis matters, whether the policy actually fits the client's risk profile. Experience and taste built over years.

This maps onto the defensibility matrix from Chapter 1 with uncomfortable precision. Intelligence work lives in the upper-left quadrant: digital, copyable, commodity. It's the same quadrant where features live. Judgment work lives on the right side: built on accumulated experience, embedded in relationships, resistant to replication. The stuff you can't paste into a prompt.

The parallel is worth pausing on. Chapter 1 argued that features are indefensible because they're digital and copyable — anyone with factory tooling can rebuild them. Intelligence work is indefensible for exactly the same reason. The rules, however complex, can be encoded. An agent that processes insurance applications is doing intelligence work: following elaborate rules at speed and scale. That work is valuable, but it's commodity valuable. Any competitor with the same model access and domain rules can replicate it. The defensible part — the judgment about whether the application makes sense for this client, this risk profile, this market condition — sits in the right-hand column of the matrix. It's accumulated, embedded, and hard to copy.

The higher the intelligence ratio in any field, the sooner autopilots win. Software engineering got there first. Over half of all professional AI tool usage is in software development. Every other category is still in single digits. That's why the factory exists in our industry before it exists anywhere else.

But here's the complication I keep running into in practice. The factory is excellent at intelligence work. An agent can write correct code, pass tests, ship a PR. The gap between "agent writes correct code" and "agent knows what the customer actually needs" is the judgment gap.

The blank input problem

I've seen this gap in every onboarding session I've run. The primary blocker is never model quality or cost. It's what I've started calling the blank input problem: people stare at an empty chat and freeze. They don't know what to ask.

A PM with access to the best tools in the world sits there with a blinking cursor because specifying intent — knowing what you actually want, clearly enough for a machine to do it — is judgment work. I watched a Center of Excellence with 65 people at a major enterprise where everyone had tool access and nobody knew what to hand off. The tools can do the work. The bottleneck is knowing what work to hand them.

This is more revealing than it appears. We tend to assume that the hard part of work is execution — the doing. The blank input problem exposes that a significant portion of what professionals do isn't executing tasks but figuring out what the tasks should be. A lawyer doesn't just draft contracts; she decides which clauses matter for this deal, this counterparty, this jurisdiction. An engineer doesn't just write code; he determines which technical approach fits the constraints he's learned to see through years of watching systems fail. Strip away the execution and what's left is a person staring at a blank prompt, realizing that the execution was the easy part all along.

The blank input problem also reveals something about how professional expertise actually works. Most experts can't articulate what they know — their judgment is embedded in thousands of micro-decisions they've internalized to the point of automaticity. Ask a senior engineer why she chose one database over another and she'll give you a reason. Ask her to enumerate all the factors she weighed, in order, with confidence levels — the kind of specification an agent would need — and she stalls. The knowledge is real. It's just not in a format that transfers to a prompt.

Internally, that's an onboarding problem — you can train people to decompose their work into handoff-ready chunks. Externally, it's an existential one. The autopilot company has to solve the blank input problem for every customer in their domain. They have to encode enough judgment about what "good" looks like that the customer can say "close the books" and walk away. That's a harder problem than building the agent.

And it's the problem where the defensibility actually lives. The autopilot that can accept "close the books" as a complete input has encoded a massive amount of domain judgment — what accounts to reconcile, which discrepancies matter, what the regulators expect, how this company's chart of accounts differs from the standard template. That encoded judgment is the moat. It took years of accumulated data and operational experience to build. A competitor can clone the agent infrastructure in a weekend. They can't clone the judgment layer.

Today's judgment becomes tomorrow's intelligence. As autopilots accumulate proprietary data about what good work looks like in their domain, they build exactly the data flywheel I described in the first chapter. Crosby processes thousands of NDAs. Each one teaches the system what "good" looks like for that contract type, that industry, that jurisdiction. The autopilot doesn't become defensible through its code. Code is commodity. It becomes defensible through accumulated judgment-data, the same way Google's search quality compounds through usage data. The right-hand column of the matrix, built one transaction at a time.

What the factory changes

This series tracked what AI does to products, teams, individuals, and production. The business model is the last domino.

The factory doesn't just change how software gets built. It changes what software companies are. The trillion-dollar outcome Sequoia is betting on is a software company that looks like a services firm: AI-native infrastructure doing the work that armies of professionals used to do, priced on outcomes, not seats. This book explains why the timing works. The factory makes code free. Free code makes features free. Free features make the tool layer commoditize. And when the tool layer commoditizes, the value migrates to the work the tools were supposed to enable.

The copilot companies face a classic innovator's dilemma during this migration. Harvey sells to lawyers. Rogo sells to investment banks. Their customers are the professionals. To become an autopilot means selling the work directly to the company, which means the professional — your current customer — is now your competitor. You're cutting them out. The fastest-growing AI companies in 2025 were copilots. Many are now trying to become autopilots. They have the product knowledge and the customer relationships. But the transition means cannibalizing the people who pay you.

The professionals using copilots are the steersmen of their industries — still steering, still reviewing, still signing off. The autopilot is what comes after the steersman. And the steersman, understandably, doesn't want to hear that. If you're Harvey and your best customers are the lawyers who adopted your tool earliest, telling them "actually, we're going to sell legal work directly to your clients" is a conversation that ends the relationship. The pure-play autopilots skip this conversation entirely because they never had it to begin with.

That's the opening for pure-play autopilots — companies that start by selling the work from day one, without a copilot customer base to protect. Crosby didn't start by selling to law firms and then pivot. They went straight to the company. WithCoverage went straight to the CFO. They don't face the dilemma because they never had the professional as a customer in the first place.

The uncertainty is the judgment question. Bek's entire framework rests on the intelligence/judgment split holding long enough for autopilots to accumulate the data that lets them handle judgment too. I've been honest throughout this book that the layers keep collapsing. Problem selection, constraint definition, drift correction. Each was supposed to be durably human. Each is partially automating. If judgment automates faster than expected, the autopilot doesn't just replace the outsourced worker. It replaces the firm that replaced them. And then it replaces whatever replaced that.

The factory made the guns faster. Now it's aiming them for you.

Afterword

Afterword: The Meta-Play

This book is an example of its own argument.

The chapters are content — features, in the language of Chapter 1. Anybody with access to the same models, the same industry reports, and a free weekend could produce a similar-looking document. The articles that became these chapters started as LinkedIn posts. LinkedIn posts are about as defensible as a to-do app.

What's harder to replicate is the thing underneath the articles. The analytical framework that connects Wardley mapping to Carlota Perez to Hamilton Helmer to what I watched happen in a conference room when sixty-five people stared at blank AI prompts. The operational experience — onboarding sessions, team reorganizations, factory infrastructure built and rebuilt — that shaped which claims survived contact with reality and which ones I quietly dropped. The network of people who read early drafts, pushed back, pointed me toward evidence I'd missed, and pressure-tested the argument in their own organizations.

The articles are the output. The knowledge base is the moat.

That parallel isn't clever. It's just true, and I think it's worth naming. The defensibility matrix from Chapter 1 applies to intellectual work the same way it applies to software products. Content sits in the upper-left quadrant: digital, copyable, commodity. The judgment that produced the content — which evidence to trust, which frameworks to apply, which conclusions to resist — sits on the right. Accumulated through experience. Embedded in relationships. Resistant to replication.

These ideas developed through a specific process: writing publicly, getting feedback, going deeper, writing again. The first article was wrong about several things. Readers told me. I revised. The second article incorporated what the first one missed. By the fourth, the argument had been sanded down by enough outside contact that the parts still standing were probably load-bearing. That iterative cycle — publish, get challenged, revise, publish again — is itself a judgment process. It's the same process I described throughout this book as the durable human contribution: not the production, but the steering.

I want to close by pointing outward rather than wrapping things up neatly. This book describes a transition, not a destination. The layers keep collapsing. Problem selection was supposed to be durably human; it's partially automating. Constraint definition was supposed to require experience; agents are getting better at it. The blank input problem — the observation that people freeze when asked to specify intent — may turn out to be a temporary artifact of unfamiliar interfaces rather than a permanent feature of human cognition.

I don't know where it stops. Nobody does, and I'm skeptical of anyone who claims to. What I've tried to do here is describe the current moment clearly — the mechanism, the evidence, the implications — so that the description is useful even after the moment passes. The honest position is to keep watching, keep adjusting, and resist the temptation to declare either victory or defeat for human judgment before the evidence is in.

The factory is running. What you aim it at is still up to you. Probably.

By Sam Zoloth

Product leader exploring how AI changes software development, team structure, and competitive strategy.

The Software Industrialization Thesis

Preface

Who this is for

What this is not

How to read this

The End of Features

Why this changed

What survives

The Aggregation Theory lens

Why this is uncomfortable

Boundaries

How to use this

The Hidden Complexity

Three ways this breaks

Context redundancy

Context decay

Capability silos

The wrong rollout

What better teams share

A four-layer model for team AI

Layer 1: Live infrastructure

Layer 2: Project context

Layer 3: Team capabilities

Layer 4: Personal preferences

How to tell whether your setup scales

Boundaries

Where to start

The Steersman

Closing the loop

What changed on the ground

What the steersman keeps

Problem selection

Constraint definition

Drift correction

What to practice now

Boundaries

The real adjustment

The Software Industrial Complex

What makes this different from copilots

The factory pattern

Triggers

Sandboxes

Context injection

Verification

Single-agent vs. multi-agent

Why this matters strategically

The blueprint is getting cheaper

The false summit

Boundaries

A practical diagnostic

Eight frameworks, one conclusion

The convergence

What Wardley sees

What Helmer reveals

Where Christensen breaks

The turning point

The value chain inverts

The acceleration and the transient edge

What's missing from the thesis

So what

The Defensibility Inversion

The wrong question

Implication 1: your value chain is upside down

Implication 2: the counter-positioning window is real and closing

Implication 3: you need a disengagement plan for features

Implication 4: the measurement stack is wrong

Implication 5: the hiring profile changes

Implication 6: the institutional gap is a decade wide

The uncomfortable timeline

The Work

The factory makes the wrong thing

Copilots and autopilots

Intelligence and judgment

The blank input problem

What the factory changes

Afterword: The Meta-Play