I Optimized My Context Window and Saved 60% on Orientation Tokens

My session_memory.md file used to be eleven kilobytes.

Eleven kilobytes is not a large file by any normal standard. It’s smaller than a single browser cookie pile. It’s a few seconds of a podcast. As a Word document it would be one page with generous margins. But every Claude session I started, those eleven kilobytes were the first thing the model read. Roughly three thousand tokens of overhead before the conversation even began. Three thousand tokens at Sonnet pricing is about a penny. A penny per session, two hundred sessions a month, that’s a couple of dollars I was paying to make Claude re-read the same file it had read yesterday and the day before and the week before that.

Money was the least of it. The actual cost was time. Every session waited for the model to chew through three thousand tokens of preamble before getting to the part where I asked it to do something. The time-to-first-token was longer than it needed to be by a margin I could feel. Responses were a little duller, too — or at least I told myself they were. By the time Claude got to my actual question, the context window was already padded with eleven kilobytes of stuff it didn’t strictly need to do the thing I was asking.

So I reorganized.

The new structure has three layers. A CLAUDE.md at the top, roughly 350 tokens, loaded automatically every session. It contains: who I am, what I do, the hard rules (no employer references, only push to linuxlsr repos, never apply Terraform without confirmation, that kind of thing), my tool preferences (python3 not python, what shell I use, what cloud I’m on). Stuff that’s true every session and that the model should know without thinking. Then a MEMORY.md, roughly 200 tokens, that is just an index — one line per memory file, each line saying what the file is about. That’s it. The index doesn’t contain the memory itself. It contains the pointer to the memory and the description of what’s in it.

The third layer is the memory folder, which gets read on demand. Project-specific context lives there. Reference data lives there. Schemas, infrastructure scope, anything that’s only relevant when working on a specific thing. Today I added reference_sites_and_infra.md to it, with the AWS account scope and the site bucket mapping. The cost-delta scheduled task references that file by name. When the task runs, it reads CLAUDE.md and MEMORY.md, sees that reference_sites_and_infra.md exists, decides it’s relevant to the current task, fetches it. Total tokens loaded for that session: somewhere around seven hundred. Down from three thousand. Same coverage. Better performance.

Seven hundred tokens instead of three thousand is a 76% reduction in per-session overhead. The cost savings are trivial — call it five or six dollars a month at my volume. The performance change is not trivial. Responses come back faster. The model isn’t dragging eleven kilobytes of preamble around before doing the actual reasoning. Time-to-first-token measurably dropped. Sessions feel snappier in the way a laptop feels snappier after you closed forty Chrome tabs.

The reframe: I stopped treating context like a packing list and started treating it like a database.

There is a more honest framing for what I did here, which is that I stopped treating context like a packing list and started treating it like a database.

Old mental model: give Claude everything it might possibly need at the start of the session, so it doesn’t have to ask for anything. Pack heavy. Be prepared.

New mental model: give Claude an index it can query. Pack light. Trust the model to fetch what it needs.

This is the cache-control versus lazy-load pattern from web architecture, applied to AI context windows. We have known forever that “load everything you might need on startup” is a bad system design. Boot times suffer. Memory usage balloons. Adding a small thing to the startup path is suddenly expensive because everyone pays the cost on every boot. We’ve known forever that “fetch on demand, cache when hot” is a better architecture. Lazy load is the default in every modern web framework, every modern ORM, every modern CDN. We didn’t apply the pattern to AI context windows for the first year or two because context windows were small and the overhead didn’t matter. Now context windows are huge and the overhead is starting to matter. The pattern transfers. The lesson is the same lesson we’ve been re-learning since the first time someone wrote a poorly-cached SQL query.

If your CLAUDE.md is over about 500 tokens, you can probably split it. If your session_memory.md is over a kilobyte, you almost certainly can. The shape of the split is: ruthlessly small top-level file with just the identity and hard rules. Tiny index pointing at deeper files. Deeper files that get read only when relevant. Add a one-line description to each entry in the index — the model can decide whether to fetch based on the description alone. You’d be surprised how often the right answer is “don’t fetch, you don’t need it.”

Things that belong in the always-loaded CLAUDE.md: your identity, your hard rules, tool preferences that affect every session. Hard rules in particular — anything that says “never do X” or “always use Y” — should be in the always-loaded layer, because by the time you fetch it on demand, the model may already have made the mistake the rule exists to prevent.

Things that belong in on-demand memory: project-specific schemas, infrastructure scope, account identifiers, naming conventions for a specific codebase, anything that’s only relevant when working on a specific thing. The cost-delta task’s reference file is the canonical example: 100% relevant when running the cost-delta task, 0% relevant for anything else.

Things that probably don’t belong in memory at all: the current shape of your code (it’ll be wrong in three weeks), the open issues in your repo (Github knows, Claude doesn’t need to), anything that mutates faster than you can curate. Memory should be for things that are stable enough to be worth writing down once.

The work is in the curation, not the writing. Anybody can dump eleven kilobytes into a memory file. The discipline is figuring out which of those eleven kilobytes were doing useful work and which were just sitting there increasing the boot cost of every session. Mine, when I audited, broke down roughly as: 30% useful identity and hard rules, 20% useful project context, 50% out-of-date or never-actually-referenced material that I had added once for some forgotten reason and never cleaned up. The 50% was the easy win. Moving the 20% to on-demand was the medium win. The 30% stayed put, and that’s the new CLAUDE.md.

I’ll keep refining it. Three weeks from now I’ll probably trim the CLAUDE.md again. Three months from now I’ll probably split the project memory into per-project files. Six months from now the whole approach will look quaint, because Anthropic will have shipped some better-than-I-can-imagine memory system that does the curation for me. Until then I am doing the curation by hand, on a Tuesday afternoon, with a wc -c and a willingness to delete.

This is what optimization looks like in the AI age. Not faster CPUs. Not cleverer algorithms. Smaller homework assignments. Your model is smart. Your model is fast. Your model is, possibly, spending the first three thousand tokens of every session reading a file you wrote three months ago and forgot to trim. Trim it. Make the homework smaller. Watch the response times drop and the bills shrink and the answers, sometimes, get sharper — because the model wasn’t drowning in irrelevant preamble before it got to your actual question.

The bottom line: The first seven hundred tokens your AI reads in any session are now the most expensive seven hundred tokens it’ll read all day. Spend them well. The rest is on-demand.

The first seven hundred tokens your AI reads in any session are now the most expensive seven hundred tokens it’ll read all day. Spend them well. The rest is on-demand.