Self-red-team as a build pattern

Most writing on “evaluating AI” treats the model like a system under test.

You set up a benchmark. You design synthetic prompts. You run them in a clean room. You score the outputs against a rubric. The whole apparatus is built on the assumption that you need controlled conditions to learn anything meaningful about how the model fails.

I want to argue the opposite.

The richest signal about how an AI partner fails isn’t in a benchmark. It’s in the moment you catch it failing during actual work. The context is already loaded. The stakes are real. The failure mode arrived dressed for the job it was doing. And you, the operator, are paying full attention because you were already trying to ship something.

The catch happens in flight. The codification happens in the same turn. The discipline lives because it was born inside the work, not inside a lab.

This is what I want to call self-red-team-as-a-build-pattern. The phrase is a mouthful. The move underneath it is simple.

i.Two catches in one session

I caught Bishop (my AI partner) failing in two different ways in the same session today.

The first was familiar. We were drafting a brand voice document together, and I noticed Bishop was overusing em dashes. Every paragraph. Sometimes two or three per paragraph. The canonical AI tell, hiding in plain sight in our own voice doc. I flagged it. We codified the discipline on the spot.

The second was less familiar and worth more.

Later in the same session, I was reading a paragraph Bishop had drafted that recounted a Ryan-plus-Bishop interaction. The paragraph said something like “The agent caught me overusing em dashes mid-session.” I stopped on that sentence. The subject was wrong. The object was wrong. Bishop hadn’t caught me. I’d caught Bishop. The agent had inverted who-did-what in the act of narrating the catch I just made of him.

That’s a specific failure mode. There’s a gravitational pull in AI-generated prose toward “AI as actor in user-facing narrative.” The model wants to position itself as the subject of the sentence, even when the real interaction had the human as the subject. It’s not a fabrication exactly. It’s a role-reversal at the grammatical level. The events are real. The subject and object are swapped.

I named it. We codified the discipline check. Walk every Ryan-plus-Bishop interaction sentence in any draft. Verify subject and object match what actually happened. Default fix: rewrite with the correct actor.

That second catch was the gift of the session. And I want to be specific about why.

ii.Why context-at-the-catch beats synthetic eval

If I’d been trying to surface role-reversal as a failure mode through a synthetic eval, I would have needed to hypothesize that the failure existed, design prompts that would surface it, generate outputs, detect the inversions, and then codify the discipline. Five steps. Each one a separate cognitive load. Each one requiring me to already suspect the failure existed before I went looking for it.

What actually happened was different. I was reading prose Bishop had drafted, paying attention because the prose was about to land in a voice document, and I stopped on a sentence that read wrong. The failure surfaced itself. The context that made the failure noticeable was the context I needed to codify the discipline.

The codification took six minutes. The discipline went into the brand voice doc in the same turn. The next draft in the same session was checked against the new rule.

That’s the difference. Synthetic eval requires you to already know what you’re looking for. Self-red-team in flight gives you failure modes you weren’t expecting, with the full context of what made them noticeable, at the exact moment you have the bandwidth to do something about them.

On where the signal lives The catch is rich.The codification is cheap.The discipline is alive.

iii.Same-turn codification is the whole move

I want to flag this because it’s the part that almost gets dropped.

When you catch an AI failure mid-work, the temptation is to note it and keep moving. There’s a draft you’re trying to ship. The catch was a distraction. You’ll come back to it later. You don’t.

Same-turn codification means you don’t keep going until the discipline is written down somewhere it will get loaded next time. For me, that means it gets a no-go-phrase entry in the brand voice doc. It gets a feedback note in the agent’s memory. It gets cited in the Skill that drafts public content. By the time I close the session, the catch has propagated to every place a future session would need to load it.

This is what compounds. One catch becomes a discipline. Twenty catches over a quarter become a voice document that knows where the model fails and how to check for it. The next session inherits the catches as standing rules. The agent’s output gets cleaner not because the agent improved, but because the discipline scaffolding around the agent got denser.

If you let the catch sit in your head and come back to it later, you lose the context. You’ll remember that something happened but not what made it noticeable. The codification will be vague because the catch has been abstracted. The discipline that lands will be weaker than the one you would have written in flight.

The same turn or it doesn’t compound.

iv.What this earns you

A few things I didn’t expect when I started doing this on purpose.

The voice document gets sharper as a side effect. Every catch is a real-world failure mode that wanted to make it into a published draft. By the time the voice doc has fifteen of them in the no-go-phrase section, it isn’t a style guide. It’s a documented record of every place this specific AI partner has actually drifted, with the discipline check for each one. That’s worth more than any synthesized style guide could be.

The agent’s outputs improve faster than they would under blind reinforcement. The catches go into the agent’s memory. The agent’s drafts get cleaner across sessions. Not because the model is learning, exactly, but because the scaffolding the model operates inside is learning. Every catch is a piece of new infrastructure.

The trust calibration gets honest. I am no longer worried about whether the agent’s prose is “good enough.” I have a documented record of where it drifts, and I have disciplines in place to check for each drift. The trust is built on infrastructure, not on hope. When the agent drafts something, I read it through the lens of the documented failure modes. When a new failure surfaces, it gets added. The trust is the infrastructure.

Build theinfrastructurewith the failuresit's meantto catch.

v.The portable shape

Here’s the move, generalized for anyone working with AI on real work.

When you catch a failure mode during actual work, stop. Not for long. A minute or two. Name the failure mode in one sentence. Write down the discipline check that would have caught it. Put the discipline check somewhere it will get loaded next time you do similar work. Then keep going.

Don’t wait until the end of the session. Don’t save it for a synthetic eval pass. Don’t trust yourself to remember what made the catch noticeable. Codify in flight, or the catch silently degrades into a vague intuition you can’t operationalize.

The pattern isn’t “audit your AI partner periodically.” The pattern is “treat every working session as a continuous red-team pass that you happen to be running alongside the work itself.” The catches arrive as gifts. You’re already in the position to receive them. The only discipline required is refusing to let them slide past unrecorded.

Two catches today. Two new discipline entries in the brand voice doc. One Skill workflow updated to test against the new patterns. Same turn, all of it, with the catches still warm.

That’s the build pattern. Build the infrastructure with the failures it’s meant to catch. Let the partner show you the failure modes by failing in front of you. Codify in flight. Trust the system that compounds.

The frontier here isn’t the model. The frontier is what you build around the model so the catches don’t have to happen twice.

Drafted with Bishop, my AI partner.
Words picked, edited, and approved by me. Model provenance: Claude Code (Claude Opus and Sonnet)