The Mozart
Principle
Context, Cache & Prompt Management at Scale
A name for the discipline behind running effectively unbounded conversations inside a fixed 200K token window — context assembly, memory, tool schemas, and prompt caching, orchestrated together so the window recurs and develops instead of resetting.
A 200K Window, Played Indefinitely
Anthropic's models give you a 200,000 token context window. Most systems treat that as a ceiling — fill it, hit it, truncate, repeat.
DevGPT treats it as an instrument: a fixed canvas where what's stable recurs from cache, while only the smallest possible slice is freshly composed each turn. Scroll to see how the window is built.
System Prompt
At the top of every request sits the system prompt — instructions, persona, formatting rules. It almost never changes, so it's the first thing placed above the cache boundary.
Every cache hit starts here.
Org Memory
Shared institutional knowledge — team conventions, architecture decisions, codebase summaries — identical across every user in an org.
Cache it once, and every subsequent request from anyone in that org reads it for free.
Team Memory
One layer down: team-specific context — which services this team owns, current priorities, recent incidents.
Still shared across users, still cache-eligible, refreshed on a slower cadence than session state.
Session & User Memory
Per-user state — what you're working on, recent decisions, your preferences. This refreshes more often than org or team memory.
But within a session, it's stable enough to sit above the cache boundary too.
Tool Schemas
Tool and function definitions are large, verbose, and almost never change mid-session.
They're some of the highest-value real estate in the cache — pay for them once, reuse them for the entire session.
Rolling Conversation Window
Here's where most systems get it wrong: they resend the entire conversation, uncached, every single turn.
DevGPT manages a rolling window with cache breakpoints placed deliberately — so only the newest few turns fall outside the cached prefix. The rest of the conversation stays cached as it grows.
The Live Turn
The smallest possible slice — your newest message and the model's response — is the only part that always hits live inference.
Everything above it: read from cache, nearly free, nearly instant.
The Window Never Resets — It Develops
A musical theme doesn't restart every time it recurs — it's restated, varied, built on. That's what a well-managed 200K window does.
The stable parts — system prompt, memory, tools, the bulk of the conversation — recur from cache. Only the newest material is freshly composed. Run well, the window never has to reset.
This is what gets DevGPT to ~90% prompt cache hit rates, ~20M peak cache reads/min, and ~6M TPM of live inference against a 30M budget — the same window, sustained indefinitely.
Managing the Conversation Itself
The rolling 152K conversation window isn't just a buffer that grows until it's full — it's actively managed, turn by turn, with a handful of techniques that keep it healthy indefinitely.
Here's a representative slice of a real DevGPT session — nine turns into a refactor — and the techniques applied to it as the conversation grows.
Cache Breakpoints
Anthropic's prompt cache works on prefixes: everything up to a cache_control breakpoint is either a hit or a miss, as a block.
DevGPT places a breakpoint at the start of each new user turn. Every previous turn — system prompt, memory, tools, and the entire prior conversation — reads from cache. Only the newest turn is ever priced as a miss.
The Sliding Window
As the conversation grows past the first few turns, the oldest ones slide out of the "hot" window — but they aren't deleted.
They move into a compacted memory tier, ready to be summarized rather than resent verbatim.
Snip, Don't Resend
Tool calls — reading a 240-line file, running a test suite with 400 lines of output — dump enormous amounts of text into the conversation.
Once that output has served its purpose, DevGPT snips it down to a head and tail, joined by a literal [...snipped...] marker — enough for the model to recall what happened without resending the whole thing on every turn.
Folding Related Turns
A back-and-forth that resolves cleanly — "add tests," "wrote the tests," "ran them, they pass" — doesn't need to stay as three separate turns forever.
DevGPT folds resolved clusters into a single collapsed entry with a one-line summary. The detail is still retrievable if needed, but it stops costing tokens on every turn.
Microcompaction
Snipping and folding aren't one-time events — they run continuously as a lightweight background pass, trimming a little more off older, already-condensed turns each time the conversation advances.
It's the difference between a single jarring cleanup and a conversation that quietly stays tidy the whole time.
A Haiku Generator on the Side
Not every tool call needs to block the main conversation. DevGPT runs a small async generator on Haiku alongside the primary model — it watches for outstanding tool calls, picks them up, and waits on the results.
When a result lands, the Haiku worker folds it into a short note and writes it back into the main window as a tool result — so the primary model can fire off several tool calls in one turn and keep going, instead of waiting on each one in sequence.
Autocompaction at 167K
Microcompaction buys time, but eventually a session crosses ~167,000 of the 200,000-token budget — close enough to the ceiling that something has to change.
At that point DevGPT runs a full compaction pass: everything before a literal [COMPACTION_BOUNDARY] marker is replaced with a single dense summary. The conversation that follows starts fresh against that summary, well clear of the ceiling, and keeps going.
Past Tense vs. Future Tense
How the compacted summary is phrased matters almost as much as what's in it. "Tests were fixed and pass" reads as a closed fact. "Tests will be fixed" reads as an open task.
Summaries — and instructions generally — are written in the past tense, describing what was done, not the future tense describing what should happen next. A future-tense summary reads like a to-do list the model feels obligated to revisit; a past-tense one reads like history it can build on.
Beta Headers Worth Knowing
Anthropic ships some of this behind beta headers. fine-grained-tool-streaming-2025-05-14 streams tool-call arguments incrementally instead of waiting for the full JSON block — useful for the Haiku side-channel above, since it can start acting on a call before the primary model finishes emitting it.
There's also a context-1m-* beta for a 1M-token context window. DevGPT doesn't use it — the techniques on this page are what make a steady 200K window sufficient — but it's there if a workload ever needs the extra headroom.
The Word That Causes Infinite Loops
All of this works as long as the agent doesn't go looking for what's no longer there. The phrase that breaks it: "re-read the conversation," "review everything above," "go through all the messages."
Folded and snipped content, and anything before [COMPACTION_BOUNDARY], is gone from the live context — replaced by a summary. An instruction to "re-read" sends the agent hunting for the original text. It isn't there, so it calls its tools again to find it, gets the same compacted result, and asks again. Without a guard, that's an infinite loop from a single word.
Indefinite Context, By Design
Cache breakpoints, sliding windows, snipped tool output, folded turns, microcompaction, background workers, and autocompaction at the [COMPACTION_BOUNDARY] — seven small disciplines, applied continuously, are what let a single session run for hours without ever resetting or hitting a wall.
Two rules layered on top of all of it: write summaries in the past tense, as completed fact rather than pending work — and never ask the agent to "re-read" what's already been compacted. Ask it to continue from the summary instead.