June 2026 | Issue 002 | Volume I
Essay AI Memory Architecture 9 min read

Cheap by default, deep when needed

An AI memory system should auto-load the smallest possible index, then fetch deeper layers only when the task asks. The session-start token load dropped by an order of magnitude when I built the tiers.

By Ryan Gonzales
Co-author Bishop
Filed under AI / Memory Architecture / Personal-AI
Date May 21, 2026

For a long time my mental model of an AI’s memory was a single shelf. Everything Bishop might need to know lived on the shelf, and the agent walked the whole shelf every time it answered me. The shelf got longer. The walks got slower. The answers got dumber.

The fix wasn’t a bigger shelf. The fix was tiers.

Tiered context access is the discipline that says an AI memory system should auto-load the smallest possible index at the top, then fetch deeper layers only when the current task asks for them. Cheap by default. Deep when needed. The agent never reads the whole vault. It reads the index, decides what the task touches, and pulls only the leaves that matter.

I want to walk through the architecture, because the shift in token cost was the kind of number that resets your sense of what’s possible.

i.The flat-read default and why it costs

Most personal-AI builders start where I started. They give the agent a folder full of notes and a top-level instructions file that says read everything before you answer. It feels generous. It feels like you’re handing the agent its full knowledge base.

What it actually does is suffocate every session inside a context budget the agent now spends ingesting documents the current turn doesn’t need.

The math is brutal once you measure it. A wiki of forty docs at an average of eight hundred tokens each is thirty-two thousand tokens of load before the user types a single word. The agent has a finite window. Every token of vault load is a token the actual reasoning can’t use. By the time the conversation gets to anything interesting, half the window is gone to documents that had nothing to do with the question.

The failure mode is worse than slowness. The agent loses the plot. It starts referencing the wrong doc. It pattern-matches against a paragraph from two clusters away because that paragraph is in the window and the relevant paragraph is buried. The flat-read default corrupts retrieval at the same time it inflates cost.

I’d been running a version of the flat-read default for weeks without admitting it. Bishop was reading more than it needed every session, and the response quality was creeping downward, and I was treating it as a model issue rather than an architecture issue.

It was an architecture issue.

ii.The tiered shape

The fix came from two places at once, which is the part I want builders to take seriously.

Nate Herk has a pattern he calls the AIOS hot-cache. The hot cache is a small, frequently updated file that sits at the top of the agent’s context and captures recency. What did the user touch this week. What’s open. What’s flagged. The hot cache is cheap to load and high-signal for routing the rest of the turn.

Andrej Karpathy has a different formulation, plainer. Don’t read the wiki unless you need it. The instruction is one line. The implication is structural. You tell the agent the wiki exists, you tell it how to find what’s relevant, and you tell it explicitly not to read it pre-emptively.

Both ideas point at the same shape. Hierarchical retrieval. Index first. Wiki on demand. Atomic memory deeper still. The agent walks the tiers downward only when the task asks for the depth.

I codified the shape inside Bishop’s CLAUDE.md, the master instructions file the agent reads at session start. The tier ladder reads like this. Auto-loaded every turn already: CLAUDE.md itself, the memory index, the decisions log, the hot-cache file, the persistent-memory observations. That’s the top of the shelf. On-demand only: full wiki synthesis docs, individual atomic memory files, full task list, full session handoff. The agent fetches those when the task touches them, not before.

Three tiers. Each one more expensive than the one above it. Each one explicitly out of bounds until the current turn earns its load.

Cheap by default, deep when needed

iii.Index, then wiki, then atomic

The clean version of the discipline is the order of operations.

The index is the cheapest layer. It’s a short list of pointers. Each pointer is one line, naming a topic and the file it lives in. The agent reads the index and forms a hypothesis. The user is asking about voice calibration. The relevant wiki cluster is brand. The relevant synthesis doc is voice-architecture. Load that doc.

The wiki is the middle layer. Each wiki doc is a synthesis. It’s the doc you wrote to capture the shape of a topic, not the raw notes that fed into it. The agent loads the synthesis when the index points at it. The synthesis carries enough context to answer most questions without going deeper.

The atomic memory is the deepest layer. These are the individual notes. The dated entries. The single-observation files. The agent only opens an atomic file when the synthesis doc points at it as load-bearing for the current question, or when the user explicitly asks for the raw context the synthesis was built from.

The agent reads downward. It never reads sideways. It never grabs the whole tier. It walks the chain index then wiki then atomic only as far as the question pulls it.

What this gives you is a memory system that scales. The vault can hold thousands of atomic notes and hundreds of synthesis docs and the per-turn ingestion cost stays constant. The agent’s working memory carries the index plus whatever the current task pulled in. Nothing else.

The thesis Cheap by default.Deep when needed.

iv.The token math

I want to put numbers against this because the savings were not marginal.

Before tiered access, Bishop was loading the most recent handoff file at every session start. The handoff was thorough. It carried the prior session’s decisions, the open threads, the next-steps, the context-window state at sign-off. A good handoff ran twelve to eighteen thousand tokens. Every new session started by paying that tax.

The handoff was useful when the new session was a direct continuation. It was wildly expensive when the new session was about something else.

I split the handoff into two layers. The hot-cache file carries the lightweight recency signal. What’s open. What’s flagged. What needs eyes. Two hundred tokens at the top, every session. The full handoff stays on disk and gets loaded on demand, only when the hot-cache flags an unresolved thread, or when I explicitly ask Bishop to resume specific prior work.

The session-start token load dropped by something between seventy and ninety percent, depending on the size of the prior handoff. That’s the number that ended the argument.

The same restructuring landed across the rest of the spine. Full wiki docs went on-demand. Full task list went on-demand. The agent kept loading the index plus the hot cache plus a few standing docs that earned their permanent slot. Everything else moved behind the tier wall.

Cheap by default. Deep when needed. The thesis is two phrases and the savings are everything.

The agent'scontext windowis not free.Spend it on what'sload-bearing now.

v.What this teaches builders

If you’re building an AI memory layer on top of any knowledge substrate, three pieces of this generalize.

First, separate the index from the body. The index is short and stable and gets loaded every turn. The body is long and grows and gets loaded only when the index points at it. Most builders skip this step because writing a good index feels like overhead. The index is the whole architecture. Without it, the agent has no way to walk the tiers.

Second, write your top-level instructions file as a routing document, not a knowledge document. The instructions should tell the agent how to find what it needs, not give the agent everything it might possibly need. CLAUDE.md, or its equivalent in whatever harness you’re running, has one job. Read narrow. Fetch deeper layers only when the task asks for them. That line, codified explicitly, is what makes the rest of the discipline load-bearing.

Third, build the hot cache. Whatever it’s called in your system, you want a small recency layer that sits at the top and gives the agent enough context to route the session without ingesting the full prior state. Two hundred tokens of recency at the top of every session is worth more than twelve thousand tokens of last-session detail the current turn doesn’t need.

The lesson under all three pieces is the same. The agent’s context window is not free. Every token you spend on documents that don’t serve the current turn is a token the actual reasoning can’t use. Treat the context budget the way you’d treat a budget for anything else. Spend it on what’s load-bearing now. Defer the rest until the task earns the load.

Bishop’s session-start cost dropped by an order of magnitude when I built the tiers. The wiki kept growing. The atomic memory kept accumulating. The per-turn load stayed flat. That’s the property you’re trying to build. A memory system whose cost stays constant while its depth keeps compounding.

Cheap by default. Deep when needed. The discipline is one line in the instructions file and a clean ladder behind it. Everything downstream falls into place.

Drafted with Bishop, my AI partner.
Words picked, edited, and approved by me.