The Field Manual for Agentic Engineering

OpenAI’s most productive engineering team has not written a line of code in five months. They have shipped about a million lines, all of them written by Codex, in roughly fifteen hundred merged pull requests, with a team that grew from three engineers to seven and got faster per engineer along the way. They have just published what they learned, in a post titled Harness Engineering. It is, for an executive audience, the most useful piece of writing about agentic engineering that exists in 2026, and very few executives have read it.

This is a translation.

A harness, in this context, is the set of scaffolding, tools, environments, documentation, and feedback loops that allow an agent to do reliable work inside your codebase. The model is the engine. The harness is everything else in the car. Most companies have spent the last eighteen months arguing about whether to let engineers use the engine. OpenAI spent that time building the car.

What follows are the five principles the OpenAI team published, restated for the executive who will fund them.

1. Engineering work moves from code to environment

The first observation has to land before the rest can mean anything. Humans steer. Agents execute. The engineer’s primary work product is no longer a function. It is the environment in which a function gets written: the tools the agent reaches for, the structure it pattern-matches against, the feedback it receives when it goes wrong. Last week Andrej Karpathy gave this discipline a name on the Sequoia stage and called it agentic engineering. The OpenAI post is what it looks like in production at scale.

The executive consequence is small and stubborn. The strongest engineer on your team in 2026 may write almost no application code. They will write the linters that make the application code possible. They will design the test harness, the documentation structure, the agent runbooks. If your senior engineering compensation framework still rewards lines committed, it is rewarding the wrong artefact. Reward the harness.

2. Architecture is an early prerequisite, not a late luxury

The OpenAI team enforces strict layering inside every business domain (Types → Config → Repo → Service → Runtime → UI), with mechanically-checked dependency directions and a small set of permissible cross-cutting edges. They are explicit about why: this is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it is an early prerequisite.

This is the cost-curve flip nobody is pricing yet. Strict architecture used to be a tax that small teams could not afford. The tax was paid in human time: every PR review, every refactor, every “is this in the right layer?” debate. Agents pay that tax at zero marginal cost. Once a constraint is encoded in a custom linter, it applies to every line of code forever, and the agent receives the violation in real time, with remediation instructions injected directly into its context.

The executive consequence: the question “are we mature enough to introduce strict architectural rules?” has inverted. The new question is “are we mature enough to ship fast without them?” In a high-throughput agent environment, the answer is no.

3. The repository is the system of record

OpenAI is unsentimental about this. From the agent’s point of view, anything it cannot access in-context while running effectively does not exist. The Slack thread that aligned the team on a pattern, the Google Doc with the design rationale, the architectural decision someone made out loud in a hallway: all of it is invisible to the system that is now writing your code.

The remedy is structural. Their repository contains a small AGENTS.md file, about a hundred lines, that functions as a table of contents pointing into a structured docs/ directory: design docs with verification status, execution plans (active and completed), product specs, references, a quality scorecard per business domain. They run a recurring “doc-gardening” agent that scans for stale documentation and opens fix-up PRs. Documentation, in this model, is not commentary on the system. It is part of the system.

The executive consequence: the technical writers, the platform docs team, the half-loved internal wiki are now load-bearing engineering. Fund them as such. The companies that treat documentation as overhead in 2026 will be the companies whose agents underperform a competitor’s agents on the same model, for reasons no one in the room can quite name.

4. Make the application legible to the agent

This is the principle most easily missed and the one with the largest second-order consequences. OpenAI made their application bootable per git worktree, so Codex can launch and drive its own isolated instance per change. They wired the Chrome DevTools Protocol into the agent runtime, gave it skills for DOM snapshots and screenshots, exposed logs and metrics through a local observability stack the agent can query with LogQL and PromQL.

The result is that prompts like ensure service startup completes in under 800ms or no span in these four critical user journeys exceeds two seconds become tractable. The agent can reproduce a bug, fix it, validate the fix by driving the application end to end, and record a video of the resolution. They report single Codex runs working on a single task for upwards of six hours, often while the humans sleep.

The executive consequence: the legibility of your application to your agents is now a first-class engineering property, ranked alongside reliability and security. If your logs and metrics live behind a per-query SaaS dashboard your team accesses through a browser, your agents cannot afford to learn from them at the speed they need to. Internal observability the agent can interrogate freely is no longer optional infrastructure. It is the foundation for everything else.

5. Encode taste mechanically; garbage-collect continuously

Codex, like any pattern-matching system, replicates whatever already exists in the repository, including the suboptimal parts. OpenAI’s team initially tried to manage this with a Friday cleanup ritual: every engineer spent twenty percent of the week cleaning up “AI slop.” It did not scale. Of course it did not. The throughput went up; the cleanup time stayed flat; the slop won.

What scales is encoding the taste once and letting the agent enforce it forever. They call these “golden principles”: opinionated, mechanical rules that keep the codebase legible and consistent for future agent runs. A set of background Codex tasks scans the repository on a regular cadence, opens targeted refactoring PRs, most reviewable in under a minute and auto-merged.

The executive consequence: technical debt has changed shape. It is no longer a high-interest loan you take on under deadline pressure and pay down in painful quarterly bursts. It is now compoundable in the other direction: human taste is captured once and enforced continuously on every line of code the agent writes thereafter. The discipline that compounds taste outpaces, by a wide margin, the discipline that periodically cleans up after the absence of it.

What this is, and what it is not

The OpenAI post is careful in one place where it could easily have been hyperbolic. This behaviour depends heavily on the specific structure and tooling of this repository and should not be assumed to generalize without similar investment. Do not read the field manual as a recipe you can paste into your Q3 plan. Read it as a description of what becomes possible when an organisation actually builds the harness, and the corresponding picture of what becomes impossible when it does not.

The unhelpful conversation, in most institutions, is still: should we let our engineers use AI? The useful conversation, the one OpenAI has just published the field notes for, is: what fraction of our engineering investment, this quarter, is going into the harness rather than into the features? The institutions that get that ratio right by 2027 will look as if they got lucky. They will not have gotten lucky. They will have funded the harness while everyone else was funding the engine.

If AI got faster and your company did not, the gap was always going to be measured here.

We have the engines. We need the cars.

The agent writes the code. The engineer writes the world the code lives in.