Engineering

Five stages of gradual context compaction

Compacting the whole context at 50% is wasteful. Waiting until 99% triggers Context Rot. The middle path is a five-stage pipeline that stops at the first stage that resolves the pressure.

Paulo HenriquePaulo Henrique3 min read
Five stages of gradual context compaction

A long-running agent eventually runs out of context. The naive responses are both wrong.

Compacting everything at 50% throws away information the agent will need. Waiting until the physical limit means the model is already degraded — Context Rot starts well before "full". The Chroma study (18 production models, 2025) and JetBrains' NeurIPS '25 paper both document the same curve: attention quality starts dropping around 60-70% of the window, not at the limit.

The middle path is gradual compaction. The pipeline has five stages, each progressively more aggressive. Each stage runs only if the previous one did not free enough budget. The cheap stages run often; the expensive stages almost never.

The pipeline

70% ──▶ 80% ──▶ 85% ──▶ 90% ──▶ 99%
Warning  Mask    Prune   Aggressive  Compact

The numbers are tunable, but the monotonicity matters: each stage is cheaper to reverse than the next.

Stage 1 — Warning (70%)

What happens: nothing destructive. The runtime logs that the context is approaching limits and starts tracking which messages are candidates for the next stages.

Why it exists: observability. You cannot debug "why did the agent forget X" without a trace of what was pruned and when.

Cost: zero.

Stage 2 — Mask (80%)

What happens: large tool outputs (file reads, search results, command outputs) get masked in older messages — replaced with a placeholder like [file contents from foo.ts, 4kb, masked at turn 42]. The actual content goes to a side store, recoverable on request.

Why it works: tool outputs are the largest chunks in a typical agent context, and the content of an old file read is rarely needed once the agent has acted on it. The fact that the read happened is what matters.

Reversibility: high. If a later turn needs the masked content, the runtime unmasks it.

Stage 3 — Prune (85%)

What happens: messages flagged as superseded are removed entirely. A superseded message is one the agent has explicitly replaced with a newer version — a file that was edited and read again, a task that completed, a tool call that was retried.

Why it works: production traces show that 20-30% of turns in long sessions are superseded by later turns. Removing them is structurally safe.

Reversibility: medium. The removed content is logged but not in-context.

Stage 4 — Aggressive (90%)

What happens: older masked content is dropped entirely (not just masked). Tool call results from turns older than a recency threshold are summarized into one-line entries.

Why it works: at 90% pressure, the trade is "lose specific old details vs lose response quality on the next turn". Specific old details lose.

Reversibility: low. Summaries are not the same as content.

Stage 5 — Compact (99%)

What happens: the runtime forks an auxiliary model to summarize the entire prior conversation into a compact reflection block. The block goes at the top of the new context window. Recent turns (~last 10) survive verbatim. Everything else is replaced.

Why it exists at all: you cannot stay in a session forever without this stage. But you should design so it almost never runs — every run is a quality drop.

Reversibility: none. This is destructive by construction.

The empirical evidence

Factory.ai shipped a 36,000-message session study in 2026 showing that gradual compaction (their version uses 4 stages, similar shape) kept session quality at parity with fresh starts up to message 30k. The Albert Sikkema variant (2026 architectural note) reports similar numbers in a different domain.

The Chroma Context Rot study is the strongest evidence for why the gradient is necessary at all:

"Performance degrades non-uniformly as input length grows. Models show distinct drops well before the physical context limit." — Chroma, 18 models, 2025

Both numbers — the quality cliff in the middle of the window and the cost of full compaction — point at the same architecture. Compact early, compact little, compact often.

What this changes in practice

Three operational implications, in order of importance:

  1. You need a token counter that runs on every turn. Not "estimate from words". Run the tokenizer for the model you are using. Off-by-30% estimates make the pipeline misfire.
  2. The runtime, not the agent, owns compaction. The agent should not call a compact tool. It should not be aware that compaction happened (except as a transparent reflection block). Agents asked to manage their own context spend tokens managing context instead of doing the task.
  3. Log every stage transition. When the agent later fails to recall something, you need to find when it was pruned and at which stage. The log is your debugger.

The unsexy bottom line

Gradual compaction is not a feature you ship in a demo. The user does not see it work. They notice when it does not work — when the agent forgets something it should not have forgotten, or when the session feels "stale" after 200 messages.

Build the pipeline early, monitor stage transitions, tune thresholds against real session traces. The right defaults are 70/80/85/90/99 for most workloads, but the right defaults for your workload come from the data.

Compaction is one half of context engineering. The other half is what you load into the context — retrieval, recall blocks, tool schemas. We will go deeper on that connection next. Continue the conversation in Discord.