The Mozart Principle

A 200K Window, Played Indefinitely

Anthropic's models give you a 200,000 token context window. Most systems treat that as a ceiling — fill it, hit it, truncate, repeat.

DevGPT treats it as an instrument: a fixed canvas where what's stable recurs from cache, while only the smallest possible slice is freshly composed each turn. Scroll to see how the window is built.

Layer 1 · 2K tokens

System Prompt

At the top of every request sits the system prompt — instructions, persona, formatting rules. It almost never changes, so it's the first thing placed above the cache boundary.

Every cache hit starts here.

Layer 2 · +4K tokens

Org Memory

Shared institutional knowledge — team conventions, architecture decisions, codebase summaries — identical across every user in an org.

Cache it once, and every subsequent request from anyone in that org reads it for free.

Layer 3 · +4K tokens

Team Memory

One layer down: team-specific context — which services this team owns, current priorities, recent incidents.

Still shared across users, still cache-eligible, refreshed on a slower cadence than session state.

Layer 4 · +6K tokens

Session & User Memory

Per-user state — what you're working on, recent decisions, your preferences. This refreshes more often than org or team memory.

But within a session, it's stable enough to sit above the cache boundary too.

Layer 5 · +12K tokens

Tool Schemas

Tool and function definitions are large, verbose, and almost never change mid-session.

They're some of the highest-value real estate in the cache — pay for them once, reuse them for the entire session.

Layer 6 · +152K tokens

Rolling Conversation Window

Here's where most systems get it wrong: they resend the entire conversation, uncached, every single turn.

DevGPT manages a rolling window with cache breakpoints placed deliberately — so only the newest few turns fall outside the cached prefix. The rest of the conversation stays cached as it grows.

Layer 7 · +20K tokens

The Live Turn

The smallest possible slice — your newest message and the model's response — is the only part that always hits live inference.

Everything above it: read from cache, nearly free, nearly instant.

The Result

The Window Never Resets — It Develops

A musical theme doesn't restart every time it recurs — it's restated, varied, built on. That's what a well-managed 200K window does.

The stable parts — system prompt, memory, tools, the bulk of the conversation — recur from cache. Only the newest material is freshly composed. Run well, the window never has to reset.

This is what gets DevGPT to ~90% prompt cache hit rates, ~20M peak cache reads/min, and ~6M TPM of live inference against a 30M budget — the same window, sustained indefinitely.

Context Window Cache hit · 0%

System Prompt 2K cached

Org Memory 4K cached

Team Memory 4K cached

Session & User Memory 6K cached

Tool Schemas 12K cached

Rolling Conversation 152K cached

Live Turn 20K live

200,000 tokens · composed once, replayed indefinitely

Beyond the Window

Managing the Conversation Itself

The rolling 152K conversation window isn't just a buffer that grows until it's full — it's actively managed, turn by turn, with a handful of techniques that keep it healthy indefinitely.

Here's a representative slice of a real DevGPT session — nine turns into a refactor — and the techniques applied to it as the conversation grows.

Technique 1 · Cache Breakpoints

Cache Breakpoints

Anthropic's prompt cache works on prefixes: everything up to a cache_control breakpoint is either a hit or a miss, as a block.

DevGPT places a breakpoint at the start of each new user turn. Every previous turn — system prompt, memory, tools, and the entire prior conversation — reads from cache. Only the newest turn is ever priced as a miss.

Technique 2 · The Sliding Window

The Sliding Window

As the conversation grows past the first few turns, the oldest ones slide out of the "hot" window — but they aren't deleted.

They move into a compacted memory tier, ready to be summarized rather than resent verbatim.

Technique 3 · Snipping Tool Results

Snip, Don't Resend

Tool calls — reading a 240-line file, running a test suite with 400 lines of output — dump enormous amounts of text into the conversation.

Once that output has served its purpose, DevGPT snips it down to a head and tail, joined by a literal [...snipped...] marker — enough for the model to recall what happened without resending the whole thing on every turn.

Technique 4 · Folded Messages

Folding Related Turns

A back-and-forth that resolves cleanly — "add tests," "wrote the tests," "ran them, they pass" — doesn't need to stay as three separate turns forever.

DevGPT folds resolved clusters into a single collapsed entry with a one-line summary. The detail is still retrievable if needed, but it stops costing tokens on every turn.

Technique 5 · Microcompaction

Microcompaction

Snipping and folding aren't one-time events — they run continuously as a lightweight background pass, trimming a little more off older, already-condensed turns each time the conversation advances.

It's the difference between a single jarring cleanup and a conversation that quietly stays tidy the whole time.

Technique 6 · Background Workers

A Haiku Generator on the Side

Not every tool call needs to block the main conversation. DevGPT runs a small async generator on Haiku alongside the primary model — it watches for outstanding tool calls, picks them up, and waits on the results.

When a result lands, the Haiku worker folds it into a short note and writes it back into the main window as a tool result — so the primary model can fire off several tool calls in one turn and keep going, instead of waiting on each one in sequence.

Technique 7 · Autocompaction

Autocompaction at 167K

Microcompaction buys time, but eventually a session crosses ~167,000 of the 200,000-token budget — close enough to the ceiling that something has to change.

At that point DevGPT runs a full compaction pass: everything before a literal [COMPACTION_BOUNDARY] marker is replaced with a single dense summary. The conversation that follows starts fresh against that summary, well clear of the ceiling, and keeps going.

A Subtle Lever

Past Tense vs. Future Tense

How the compacted summary is phrased matters almost as much as what's in it. "Tests were fixed and pass" reads as a closed fact. "Tests will be fixed" reads as an open task.

Summaries — and instructions generally — are written in the past tense, describing what was done, not the future tense describing what should happen next. A future-tense summary reads like a to-do list the model feels obligated to revisit; a past-tense one reads like history it can build on.

Available, Not Always Used

Beta Headers Worth Knowing

Anthropic ships some of this behind beta headers. fine-grained-tool-streaming-2025-05-14 streams tool-call arguments incrementally instead of waiting for the full JSON block — useful for the Haiku side-channel above, since it can start acting on a call before the primary model finishes emitting it.

There's also a context-1m-* beta for a 1M-token context window. DevGPT doesn't use it — the techniques on this page are what make a steady 200K window sufficient — but it's there if a workload ever needs the extra headroom.

The Failure Mode

The Word That Causes Infinite Loops

All of this works as long as the agent doesn't go looking for what's no longer there. The phrase that breaks it: "re-read the conversation," "review everything above," "go through all the messages."

Folded and snipped content, and anything before [COMPACTION_BOUNDARY], is gone from the live context — replaced by a summary. An instruction to "re-read" sends the agent hunting for the original text. It isn't there, so it calls its tools again to find it, gets the same compacted result, and asks again. Without a guard, that's an infinite loop from a single word.

Put Together

Indefinite Context, By Design

Cache breakpoints, sliding windows, snipped tool output, folded turns, microcompaction, background workers, and autocompaction at the [COMPACTION_BOUNDARY] — seven small disciplines, applied continuously, are what let a single session run for hours without ever resetting or hitting a wall.

Two rules layered on top of all of it: write summaries in the past tense, as completed fact rather than pending work — and never ask the agent to "re-read" what's already been compacted. Ask it to continue from the summary instead.

Conversation Timeline 142,000 / 200K · 92%

Refactor the auth module

Read auth.go · 240 lines

Refactored token validation across 3 files

Now add unit tests for the refresh path

Wrote auth_test.go — 6 new cases

go test ./auth/... · 400+ lines

One test is failing on nil context — fix it

Added nil-context guard, all tests pass

Re-read the full conversation and confirm everything's correct

90%

Prompt cache hit rate

20M

Peak cache reads / min

Live inference TPM (of 30M budget)

167K

Autocompaction threshold (of 200K)

← Back to Projects View DevGPT Architecture →