Why my AI agent uses Skills instead of monolithic workflows

There’s a moment in every agent build where you reach for the shortcut. You’re already deep inside one workflow. You need a small adjacent thing to happen. The path of least resistance is to graft the new behavior onto the workflow you’re already in.

One more step in the sequence. Five more lines in the file. You’re done in two minutes, and the system mostly keeps working.

I keep finding that the shortcut is the wrong call. Not always. But often enough that I’ve started defaulting the other way: when an adjacent behavior could be its own thing, make it its own thing. Even when the inline version is faster to write.

This is the post for why.

i.The temptation to inline

Last week I was finishing a Skill called session-handoff. The Skill’s job is to wrap a Claude Code session cleanly: scan the work that just happened, produce a structured handoff file, append to the changelog, surface anything the next session needs to know.

Late in the build I added a step. After the handoff file lands, mirror a curated slice of the project’s identity files into a separate private git repo I maintain as drift insurance. The repo exists so that if my main workspace ever corrupts, gets deleted, or otherwise goes sideways, I still have versioned snapshots of the core identity material the agent depends on.

The fastest way to add that mirror step was to write it into session-handoff. The Skill already had a list of finishing actions. I could append one more bullet, write the mirror logic inline, and be done. The whole thing would have taken twenty minutes.

I didn’t. I made the mirror step its own Skill called sync-to-repo and had session-handoff invoke it as the final step. The total work was maybe ninety minutes instead of twenty.

That’s the kind of decision I want to name. Because the inline version would have worked. It would have shipped sooner. And it would have been the wrong shape.

ii.The composability case

Here’s why the separate Skill won.

The mirror step has at least three legitimate invocation contexts. The one I built it for is end-of-session, fired from inside session-handoff. The second is mid-session, when I want a manual checkpoint after a heavy editing pass without wrapping the whole session. The third is a recurring cron-style fire, eventually, so the repo stays current even if I forget to wrap a session cleanly.

If the mirror step lived inside session-handoff as inline logic, only the first context would work. The second and third would require either duplicating the logic in two more places, or refactoring it out into a Skill later, after I’d already paid the cost of inlining it once. The refactor-later path is the one I keep promising myself I’ll come back to and never do.

Building it as its own Skill from the start meant the mid-session and recurring fires came for free. session-handoff calls it. I can call it manually whenever I want. A scheduled job can call it. None of those callers need to know how the mirror works internally. They just need to know the Skill exists and what it returns.

That’s the composability case. A capability that’s likely to be invoked from more than one context wants its own surface. Wrap it once, call it from anywhere.

On Skill granularity Composable Skills earntheir place via reuse,not registry size.

iii.Reuse as the test, not registry size

The lazy version of this argument is “always extract to a separate component.” That’s not what I’m saying. Most adjacent behaviors should stay inline. A workflow that lists fourteen steps doesn’t want fourteen Skills underneath it; it wants one Skill that runs fourteen steps. The granularity has to earn its place.

Why my AI agent uses Skills instead of monolithic workflows

The test I’ve started applying is reuse. A behavior becomes its own Skill when I can name at least two legitimate invocation contexts for it. One invoker is inline territory. Two or more invokers is Skill territory.

sync-to-repo passed the test cleanly. Three invokers in the foreseeable future. The Skill is justified.

A different behavior I considered extracting last month failed the same test. It was a small file-cleanup step at the tail of a writing workflow. I could imagine it standing alone, but I couldn’t name a second invoker. The cleanup was tied to the specific writing workflow it sat inside. Extracting it would have produced a Skill that nothing else was ever going to call. That one stayed inline.

This matters because the failure mode I want to avoid is registry bloat. An agent with fifty Skills isn’t more capable than an agent with fifteen. It’s more confused. Every new Skill is a new surface the agent has to route against, a new file to maintain, a new place where assumptions can drift out of sync with the rest of the system. The argument for separate Skills only holds if each one earns its place by composing against others or serving multiple invokers. The argument breaks the moment Skills start existing just to be Skills.

So the test I’m running is two-sided. Compose when reuse is real. Inline when it’s not. Both directions are disciplines. The discipline that holds the system together is refusing to ship anything that doesn’t pass one of the two tests honestly.

iv.When inline still wins

Even with the reuse test in place, plenty of behaviors should stay inline. Worth naming the cases where the separate-Skill instinct is wrong.

Behaviors that are genuinely sequence-specific belong inline. If a step only makes sense as part of a particular workflow, extracting it produces a Skill that exists in name only. The cleanup step I left inline is an example. It was tied to that specific writing workflow. There was no second context.

Behaviors that share data tightly with their caller belong inline. If extracting a step requires passing five context variables to the new Skill and having it return four more, the Skill boundary is fighting the shape of the work. The data flow is telling you the behavior wants to live with the workflow.

Behaviors that are throwaway belong inline. If you’re not sure the workflow itself will survive the next iteration, extracting one of its steps into a Skill is premature investment. Let the workflow stabilize. Extract later if reuse signals show up.

The pattern I’m naming isn’t “always extract.” It’s “extract when composability is real, and have a sharp test for when it is.”

Composable Skillsearn their placevia reuse,not registrysize.

v.What the platform’s converging on

The reason this matters past the one decision about sync-to-repo is that it’s part of a pattern I keep seeing across the agent’s build. Earlier the same question came up for a freshness-audit Skill called verify-state-fresh. Could have been an inline check inside the handoff workflow. Made it a Skill. Two invokers minimum, plus an “on explicit request” surface that wouldn’t have existed if it stayed inline.

Same question came up for a content-flagging Skill called flag-for-backlog. Could have been a step at the end of the handoff. Made it a Skill. Mid-session use case showed up within a week of shipping it, and would have been impossible if the behavior had stayed inline.

sync-to-repo is the third instance of the same pattern. The answer keeps converging. Composable Skills earn their place via reuse, not registry size. Skills that don’t pass the reuse test don’t get to exist. Skills that do pass it get to exist and get to be called from anywhere.

That’s the architecture I’m betting on. Not because Skills are inherently better than inline steps, but because the agent’s capability surface is going to grow either way, and the question is whether it grows as a graph of composable pieces or as a pile of monolithic workflows with hidden duplication. The graph wins long-term. The pile feels faster short-term and rots.

The discipline isn’t “always make Skills.” It’s “make Skills when reuse is real, and be honest about when it isn’t.” Both halves of the test are load-bearing. Without the first, you under-build and end up with fat workflows. Without the second, you over-build and end up with registry bloat.

The right answer is the discipline. Not the default in either direction.

Drafted with Bishop, my AI partner.
Words picked, edited, and approved by me. Model provenance: Claude Code (Claude Opus and Sonnet)