Trust, But Verify

The most dangerous output is the one that looks right.

Not the obvious error, the crashed build, the misspelled variable. Those are easy. They announce themselves. The dangerous output is the one that compiles, passes a quick scan, reads fluently, and does the wrong thing. It is the architectural decision that feels reasonable and quietly introduces a dependency we will regret in six months. It is the quarterly forecast in perfect formatting with a flawed assumption buried in row 47.

This is the final skill in this series for a reason. Evaluation is the skill that closes the loop. Without it, the other five (specification, judgment, decomposition, orchestration, intent) are a pipeline for producing polished errors at scale.

The Confidence Problem

Hallucinations are not a training data problem. Even with perfect data, the training objectives themselves produce errors. This is structural. It is not a bug being fixed in the next release. It is a property of how these systems work.

Large-scale analyses of AI-generated code consistently find the same pattern: more logic issues than human-written code. Not syntax errors. Logic errors: code that does the wrong thing correctly. It runs. It passes basic tests. It just does not do what we actually needed.

The organizational picture mirrors this: rising bug rates correlating with AI adoption, alongside significantly longer code review times. We are shipping faster and reviewing longer, because the output requires more scrutiny, not less.

The pattern is consistent. AI does not fail loudly. It fails quietly, plausibly, in ways that pass a casual inspection. The failure mode that matters is not the disobedient machine. It is the perfectly obedient machine that executes a flawed specification flawlessly, then presents the result with the confidence of someone who has never once doubted themselves.

What Evaluation Actually Is

Evaluation is not reading every line. That was already impractical when humans wrote the code. It is certainly impractical when AI produces ten times more of it.

Evaluation is pattern recognition applied to output. It is the skill of knowing where errors hide, what confident-but-wrong looks like, and which parts of any output deserve close attention versus a quick scan.

There are specific patterns worth learning. Boundary conditions are where AI stumbles most: the edge cases, the empty states, the “what happens when this is zero” scenarios. AI tends to build for the happy path, because the happy path dominates the training data. The specifications we write (Skill 1) tend to describe what should happen, not what should happen when things go sideways.

Contradictions in specifications cause real problems now. As models get sharper at following instructions, they take everything more literally. If our spec says “always respond within 200ms” in one section and “query the external API for real-time data” in another, the model will not flag the tension. It will attempt both, and the result will be subtly broken depending on which instruction it prioritized.

Logic flow is where the real risk lives. The code works. The syntax is clean. The variable names are sensible. But the logic does something slightly different from what we intended, and the gap is small enough to miss on a quick review. This is where judgment (Skill 2) and evaluation intersect: we need to know what “right” looks like before we can spot what is subtly wrong.

Building the Instinct

The good news is that evaluation, like every skill in this series, is learnable. And it compounds. Every error we catch teaches us where to look next time.

Start with the boundaries. What happens at zero? At maximum? When the input is missing or malformed? These are not exotic scenarios. They are the scenarios real users encounter on day two, and they are precisely the scenarios AI handles with plausible-sounding nonsense.

Check the seams. When we decompose a problem (Skill 3) and orchestrate multiple agents (Skill 4), errors concentrate where the pieces connect. Each piece works beautifully in isolation. Together, they make assumptions about each other that nobody verified.

Read the output as a skeptic, not an admirer. Our brains are wired to trust things that sound articulate. That is the trap. Fluency is not accuracy. Confidence is not correctness. The most useful mental posture for evaluation is not “let me see if this is wrong” but “let me find where this is wrong.” The shift from if to where changes how carefully we read.

Review the Output, Not the Plan

One of the most counterintuitive lessons from teams deep in AI workflows: stop reviewing the plan. Review the output.

Review shorter artifacts early: a two-page design discussion, a structure outline. These are where bad decisions are caught cheaply. Then let the agent implement, and review what it actually produced. The plan was always just a guess. The output is the truth.

A target worth stating plainly: aim for two to three times productivity, not ten. Sustainable leverage comes from quality, not velocity. Going 10x faster does not matter if we throw it all away in six months.

The Paradox of Trust

The people who trust AI the least are not the best evaluators. Neither are the people who trust it the most. The best evaluators are the ones who trust it precisely enough: enough to let it run, not enough to let it ship unchecked.

This is a skill good managers have always needed. Trust the new hire enough to let them work independently; check the deliverables before they reach the client. Trust without verification is naivety. Verification without trust is paralysis. The skill is calibrating between the two.

AI does not change this dynamic. It accelerates it. The volume is larger. The surface area for subtle errors is wider. But the fundamental skill is the same one experienced professionals have been developing for decades: knowing where to look, knowing what wrong looks like, and knowing when something that appears fine deserves a second glance.

In Montreal, where Mila’s interpretability research has shown that models develop internal states resembling emotions, the evaluation question takes on an additional dimension. A model that “feels” calm may actually be less careful than one that “feels” frustrated. The intuition that a relaxed system is a safe system turns out to be wrong. Evaluation, in other words, requires understanding not just the output but the state that produced it.

The Loop Closes

This is the sixth and final skill in this series, and it is not a coincidence that it comes last.

We started with specification, then judgment, decomposition, orchestration, and intent. Evaluation closes the loop. If the output is wrong, evaluation tells us which skill failed. Was the spec ambiguous? Did we decompose poorly? Was the intent unclear? It is not just the final check. It is the feedback mechanism that makes every other skill better over time.

These six skills are not a curriculum. They are a practice. We get better at them gradually, through use, through errors, through the slow accumulation of knowing where things go wrong and how to catch them earlier next time.

Where We Stand

The economy is splitting. That is real, and pretending otherwise helps nobody. But the gap between the two paths is not talent, credentials, or years of experience in a specific technology. It is a set of skills, six of them, that are learnable, practicable, and compounding. Every week we spend developing these skills widens the advantage, because the tools get more powerful, and the people who can direct and verify them become more valuable, not less.

This is not a warning. It is an invitation. The tools are extraordinary. The opportunities they create are real. And the skills required to seize those opportunities are not mysterious or exclusive. They are specification, judgment, decomposition, orchestration, intent, and evaluation. We have spent six articles examining each one.

The hitchhiker’s guide is complete. The journey is not. But we know the skills now. The rest is practice.

This is the sixth and final article in “The Hitchhiker’s Guide to the K-Shaped Economy,” a series on the human skills that matter most in the age of AI. The full series: Specification, Judgment, Decomposition, Orchestration, Intent, and Evaluation.