The Hidden Complexity

Chapter 2

14 min read

My personal AI setup made me much faster.

Then I tried to share it with a team, and it made everyone slower.

That is where the category error showed up. The same ingredients that made one person fast made the group harder to coordinate. My assumptions were invisible. My shortcuts were private. My prompts encoded decisions nobody else had agreed to. The setup worked because it was mine. That was exactly why it did not scale.

This is the part of team AI adoption most organizations still miss: individual AI productivity and team AI collaboration are different problems. If you solve the first one without redesigning for the second, you usually create confusion faster than output.

The previous chapter argued that features are commoditizing — that the strategic value of building software is collapsing. This chapter deals with the organizational consequence: if you try to capture that speed at team scale using the same approach that works for individuals, you hit a wall. And the wall isn't about models or prompts. It's structural.

Three ways this breaks

The failure mode is not "people need better prompts." It is structural.

Context redundancy

Five engineers with five AI setups do not create a shared system. They create five local realities.

One person's prompts assume service boundaries that another person has never accepted. One person's local automations nudge the agent toward a style the rest of the codebase does not use. The mismatch stays hidden until AI-generated work collides in code review.

The team gets more output and less coherence.

This is a familiar problem wearing new clothes. Software teams have always struggled with alignment — that's what style guides, architecture reviews, and coding standards exist to address. But AI amplifies the divergence rate. A human engineer drifting from team conventions produces drift at human speed. An AI agent drifting from team conventions produces drift at machine speed, across more files, in less time. Five engineers with five different AI setups can generate five different interpretations of the codebase in an afternoon.

Ramp's experience is instructive here. When their internal agent reached 30% of all merged pull requests, the context redundancy problem became impossible to ignore. Their solution wasn't to standardize everyone's prompts — it was to give agents access to the same systems humans already depend on: repos, CI, data stores, internal tools. Context became part of the environment instead of something each developer re-explained by hand. The lesson: at 30% of your PRs, personal context becomes organizational liability.

Context decay

AI helps teams create decisions faster than they can validate them. Architectural notes, coding conventions, automation shortcuts, and half-proven assumptions spread through the system before anyone has time to check whether they still hold.

That is not a theory problem. It is an operating problem. Spotify's engineering team has been unusually transparent about this. Their research on background coding agents surfaced a specific pattern: architectural evidence goes stale faster than teams expect, and most teams discover the rot late — during incidents or cleanup — instead of early, when it's cheap to fix.

The decay rate matters because AI-assisted development accelerates both the creation of decisions and the creation of artifacts that encode those decisions. A team without AI might update an architecture document quarterly. A team with AI agents might produce dozens of PRs per day, each one embedding assumptions about the architecture. If the underlying architectural decisions shift — and they always do — the gap between "what the code assumes" and "what's actually true" widens faster than anyone can manually reconcile.

Spotify quantifies one dimension of this: their LLM-as-judge verification layer vetoes roughly 25% of agent sessions. A quarter of the time, the agent produced code that looked correct but didn't match the actual intent. That veto rate is the visible surface of context decay — the cases where automated checking caught the drift. The invisible surface is harder to measure: the PRs that passed review but embedded slightly stale assumptions that won't surface until something breaks in production.

OpenAI attacked this problem from a different angle. Their team used to spend every Friday — 20% of the engineering week — manually cleaning up what they called "AI slop": code that was technically correct but architecturally drifting. That didn't scale, so they automated the cleanup itself, running a recurring set of background agents that scan for deviations, update quality grades, and open targeted refactoring PRs. The context decay didn't stop. They just built machinery to counteract it continuously.

Capability silos

Every team trying to scale AI eventually hears some version of the same complaint: "Wait, we already built a tool for that?"

Someone has a great deployment helper. Someone else has a useful review prompt. A third person wired the assistant into CI. None of it propagates cleanly, because personal setups are bad distribution channels for team capability.

The irony is sharp. The tools are supposed to increase shared capability. Used carelessly, they fragment it.

This fragmentation has a compounding cost. Each siloed capability represents duplicated effort — multiple people solving the same problem independently. But worse, it represents divergent solutions to the same problem, which means the team's shared understanding of "how we do things" fractures. When three people have three different deployment helpers, the team doesn't have a deployment process. It has three personal rituals.

The wrong rollout

The default rollout pattern is easy to recognize:

  • Create a shared prompt doc
  • Dump tips into Notion
  • Tell everyone to write a CLAUDE.md
  • Maybe collect a few reusable snippets

This feels organized because it produces artifacts. It does not produce shared reality.

The underlying problem remains. Truth still lives in private setups, old docs, tribal knowledge, and whatever the most motivated person remembers to update.

OpenAI tried the centralized-document approach and documented exactly why it failed. They started with a single large AGENTS.md file meant to be the canonical source of guidance for their coding agents. The problems were predictable:

  • Hard to verify — drift from reality is inevitable in a monolithic document.
  • Rots instantly — a single manual becomes a graveyard of stale rules.
  • Too much guidance becomes non-guidance — when everything is "important," nothing is.
  • Crowds out actual task context from the agent's context window.

The failure mode is worth naming precisely because it's so tempting. A shared document feels like infrastructure. It has the shape of a solution. But a document that nobody maintains, that drifts from the codebase it describes, and that competes with task-specific context for the agent's limited attention — that document is worse than no document, because it gives the team false confidence that shared context exists.

What better teams share

The clearest public example I have seen is not a prompt library. It is infrastructure.

Ramp's internal agent adoption got attention because of the number: roughly 30% of merged pull requests. The deeper lesson is architectural. They did not try to make everyone copy one person's local setup. They gave agents access to the same systems humans already depend on: repos, CI, data stores, internal tools, operational surfaces. Context became part of the environment instead of something employees had to keep re-explaining by hand.

That is the right direction.

Spotify's approach offers a different but complementary lesson. Their background agent, Honk, handles fleet-wide migrations — applying the same code transformation across hundreds or thousands of repositories. The scale forced a decision that individual-level AI usage never requires: you have to make context mechanical. Spotify chose large static prompts, version-controlled and testable, with deliberately limited tool access. They restricted their agent to a verify tool (formatters, linters, tests), a Git tool (with restricted subcommands), and a strict Bash allowlist. The reasoning: "the more tools you have, the more dimensions of unpredictability you introduce."

Stripe went the opposite direction — their MCP server "Toolshed" provides nearly 500 internal tools, from documentation to tickets to Sourcegraph code search to build status and feature flags. The maximalist approach works for Stripe because their agents operate inside a massive monorepo handling diverse, complex tasks. Spotify's minimalist approach works because their agents need to produce predictable, mergeable PRs across thousands of different repos.

The difference isn't right vs. wrong. It's that both teams made the decision consciously, at the infrastructure level, rather than letting individual developers' personal setups create the answer by accident.

At team scale, the thing to share is not "how Sam likes to prompt." The thing to share is the machinery that tells both humans and agents what is true.

A four-layer model for team AI

This is the cleanest model I know for separating what should be shared from what should stay personal:

Layer 4: Personal preferences
Layer 3: Team capabilities
Layer 2: Project context
Layer 1: Live infrastructure

Layer 1: Live infrastructure

This is the ground truth layer. CI status. Monitoring. Documentation search. Deployment state. Tickets. Build outputs. Internal APIs.

This layer answers: what is true right now?

It should come from systems, not from prose whenever possible.

At Stripe, Layer 1 is extraordinarily thick. Their 500 MCP tools mean an agent can check documentation, query Sourcegraph for code patterns, read feature flag status, and look up build results — all without a human pre-loading that context into a prompt. The agent discovers what's true the same way a human engineer would: by querying the systems that hold the truth.

At Spotify, Layer 1 is deliberately thin. They keep context gathering in the prompt itself rather than in dynamic tool calls. The agent gets a large, carefully constructed static prompt and a minimal set of verification tools. The tradeoff: less flexibility, more predictability. When you're applying changes across a thousand repos, predictability wins.

At OpenAI, Layer 1 is wired into the product itself. They connected Chrome DevTools Protocol into the agent runtime and exposed logs via LogQL and metrics via PromQL, so agents could validate their own work by actually interacting with the running application. When code throughput got so high that human QA became the bottleneck, they responded by making the application legible to agents — turning the product into its own verification system.

The common thread: Layer 1 is mechanical. It's not a document someone wrote describing the system. It's the system itself, made accessible to agents through tools, APIs, or structured interfaces.

Layer 2: Project context

This is where the durable but slower-moving guidance lives. Build commands. Architecture notes. Naming conventions. Dependency relationships. Rules that are specific to this repo or project.

This layer answers: how do we do things here?

It belongs in version control.

OpenAI's solution to their failed monolithic AGENTS.md was to restructure Layer 2 as progressive disclosure. Instead of one giant document, they created a small, stable entry point (~100 lines) that functions as a table of contents, pointing to a structured docs/ directory that serves as the system of record. Agents start with the entry point and navigate deeper as needed. They enforce freshness mechanically — linters and CI jobs check for drift, and a recurring "doc-gardening" agent finds stale documentation and opens fix-up PRs.

The progressive disclosure pattern solves a real problem with context windows. An agent working on a frontend component doesn't need the backend architecture document. An agent doing a database migration doesn't need the CSS conventions. By structuring Layer 2 as a navigable tree rather than a flat dump, the agent's limited context budget gets spent on the information that actually matters for the current task.

Layer 3: Team capabilities

This is reusable know-how. Review workflows. Templates. Common automations. Shared skills. Things that travel across projects because they encode how the team works, not how one repo works.

This layer answers: what can we reliably do together?

It should be curated, not allowed to sprawl.

Spotify's fleet management system is the most mature example of Layer 3 in production. It applies the same agent-driven code transformation across thousands of software components — the same migration pattern, the same verification criteria, the same PR structure — and tracks merge status centrally. The fleet model takes an individual capability (an agent that can do a code migration) and turns it into an organizational capability (a system that can apply that migration everywhere, monitor progress, and report results).

The curation requirement is real. Layer 3 is where teams most often accumulate junk — a shared Notion page of "useful prompts," a sprawling wiki of tips, a channel of "things that worked for me." Without active curation, Layer 3 becomes a graveyard. The test is simple: can a new team member find and use this capability on day one? If it requires oral tradition to discover or tribal knowledge to operate, it hasn't graduated from Layer 4.

Layer 4: Personal preferences

This is where private speed belongs. Shortcuts. Style preferences. Memory systems. Favorite prompts. Local helpers.

This layer answers: how do I like to work?

It should stay private unless there is a real reason to promote something upward.

This is where teams usually slip. Most failed AI rollouts are really Layer 4 leakage. Someone mistakes personal optimization for team infrastructure. Their setup works beautifully — for them. They share it with good intentions. It creates confusion, because the assumptions encoded in that setup are invisible to everyone else.

The fix is not to ban personal experimentation. Teams need people pushing the edges. The fix is to make the boundary explicit. When someone discovers something genuinely useful at Layer 4, the path should be clear: propose it for Layer 3 (team capability) or Layer 2 (project context), which means making it discoverable, documented, and maintained. If it can't survive that translation, it stays personal.

How to tell whether your setup scales

A simple test works better than most maturity models:

Can a new team member become productive with your AI setup on day one?

If yes, your context probably lives in systems they can access, docs they can trust, and capabilities they can discover.

If no, you did not build team infrastructure. You built impressive personal tooling with a shared veneer.

This test also catches another problem: stale context. If onboarding requires oral tradition, side-channel explanations, or a long tour of "things that are not written down anywhere," the team has already lost control of its knowledge surface.

Ona/Gitpod calls this the "false summit" problem: you rolled out coding agents, engineers are faster, PRs flood in, yet cycle time doesn't budge, DORA metrics are flat, the backlog grows. The gains compound with the individual, not the organization. That gap — between individual speed and organizational velocity — is exactly what the four-layer model is designed to close.

Boundaries

This does not require everyone to use the same model, the same editor, or the same personal workflow.

It also does not mean more documentation solves the problem. Static docs rot too. OpenAI's experience proves this — they went from monolithic documentation to progressive disclosure enforced by automated gardening because static docs alone were worse than useless.

And it definitely does not mean personal experimentation is bad. Teams need people pushing the edges.

It means the boundary between personal optimization and team infrastructure has to be explicit. Share the layers that create common truth. Let the rest stay local.

Where to start

Do not start with a giant operating manual.

Start with one project and make Layer 2 real:

  • Build and test commands
  • Major architectural constraints
  • Naming and directory conventions
  • Links to the systems that tell the current truth

Then improve Layer 1:

  • Connect AI tools to CI
  • Connect them to logs or monitoring
  • Connect them to your documentation source of record

Only after those are working should you spend much time formalizing Layer 3.

The rule that survives contact with a real team is short:

Share infrastructure, not prompts.

Prompts still matter. Personal experimentation still matters. But if humans and agents are operating from different versions of reality, no prompt library is going to save you.