If you have read the prior posts on this blog — the five components of memory and gradual compaction — you have seen the two halves of an agent's working brain. Memory is the durable store of things across sessions. Context is the ephemeral window the model actually sees on this turn.
The question that decides whether your system works is: how do those two talk to each other?
The answer, after cross-validating 12 reference implementations, is four rules. They are unsexy. They are also the difference between an agent that improves over time and one that gets worse the more it runs.
The fundamental shape
┌─────────────────────────────────────────────────────┐
│ AGENT LOOP │
│ │
│ ┌──────────┐ recall blocks ┌──────────────┐ │
│ │ │ ──────────────────► │ │ │
│ │ MEMORY │ │ CONTEXT │ │
│ │ (durable)│ │ (ephemeral) │ │
│ │ │ ◄────── NEVER ──── │ │ │
│ └──────────┘ └──────────────┘ │
│ ▲ │ │
│ │ store_memory │ │
│ │ (explicit via tool ▼ │
│ │ or sub-agent fold) LLM Prompt │
│ │ │ │
│ └──────────── agent ───────────────┘ │
│ decides to │
│ store_memory │
└─────────────────────────────────────────────────────┘
Three things to notice. Memory flows into Context as recall blocks. Context never flows back into Memory directly. The only path back is through the agent itself — by calling store_memory.
D1 — Unidirectional separation
Rule: Memory → Context. Context never writes back to Memory.
Why: if Context writes to Memory, you get feedback loops. The compaction artifacts from Stage 5 get persisted, then recalled, then re-compacted, and the system's idea of what is "true" drifts away from the real conversation history within ten sessions.
Evidence: Mem0 LoCoMo 2026 — a two-layer memory with explicit D1 achieves 91.6% accuracy vs 72.9% for full-context baselines. +18.7 percentage points, 3.7× fewer tokens, 91% less latency. The opposite failure is documented in Memory Curse (arXiv:2605.08060) — automatic context-to-memory propagation degraded cooperation in 18 of 28 model-game combinations.
Practical implication: when the runtime compacts the context (see gradual compaction), the summary stays in context. It does not get written to memory. If the agent wants the summary persisted, the agent calls store_memory explicitly.
D2 — Explicit writes
Rule: the agent (the main LLM) decides what to persist, by calling store_memory. There is no background heuristic that auto-promotes context turns into memory.
Why: the model knows things the runtime does not. It knows that the user prefers their coffee black, but only after the user said so in this turn. It knows that the migration plan is settled, but only after three rounds of back-and-forth. A heuristic that auto-extracts "facts" cannot tell the difference between "Maria mentioned Python" (incidental) and "Maria's stack is Python" (durable).
Evidence: Anthropic Harvey case study (Dreaming, May 2026) — task completion +6×, .docx output quality +8.4%, .pptx quality +10.1% after adopting versioned explicit writes via session_actor. The opposite path is documented in TriMem (arXiv:2605.19952) — automatic fact-extraction suffers from "brevity bias", discarding fine-grained details that matter later.
Practical implication: treat store_memory as a first-class tool. Document it in the system prompt. Run evals against the model's calling pattern. If the model is not calling it, the system prompt is the bug.
D3 — Multi-mechanism recall
Rule: memory injects recall blocks into context through four complementary mechanisms, not one.
The four mechanisms (ADR-15 in the spec):
- Automatic recall every N turns. A periodic injection that refreshes the agent's view of relevant memories. Catches drift in long sessions.
- On-demand recall via tool. The agent calls
recall_memory(query)explicitly when it knows it needs something specific. - Loop reload. After a long tool chain (e.g., a 20-call sub-task), the runtime reloads relevant recall blocks before the next agent turn. The agent's working memory of "what we're doing" survives the tool-call detour.
- JIT recall. When the agent touches a specific file or entity, the runtime fetches memories tagged to that entity. The agent gets the right context the moment it asks for the right thing, without a round-trip through
recall_memory.
Why: any single mechanism leaves a gap. Single-shot bootstrap (only at session start) drifts after 20 turns. Tool-only recall puts the burden on the model to know what it does not know. Automatic-only floods the context with irrelevant memories.
Evidence: the original @usetheo/memory design only foresaw single-shot. Cross-validation against Codex CLI and Gemini CLI — both of which implement all four — showed measurable drift in our single-shot prototype that disappeared with the multi-mechanism version. ADR-15 was strengthened from the implementations, not the other way around.
D4 — Memory as a provider, not a participant
Rule: Memory is a service the agent loop consumes. Memory is not an actor in the loop.
Why: when Memory becomes an actor (it has its own LLM calls in the main agent path, it decides things, it triggers events), debugging becomes intractable. Was the agent confused because the context was wrong, because the memory recall was wrong, or because the memory's own LLM made a bad call? You cannot tell.
Practical implication: Memory's LLM calls (the Auxiliary LLM from the five components post) run outside the agent loop. Write extraction happens after the turn. Dream consolidation happens during idle. Ranking happens during recall. The main agent never waits on a memory LLM call.
What this gets you
The four rules together produce a system where:
- The agent's behavior is traceable. If it forgot something, the trace shows whether it was never written, never recalled, or written-then-pruned.
- The memory layer is swappable. The contract is
recall(query) → blocksandstore(content). You can swap Postgres for any backing store without touching the agent. - The system improves over time. Each session adds explicit memories that survive into the next. Without D2, you get drift; with D2, you get accumulation.
Where the rules came from
This is not an invented framework. It is what falls out of comparing twelve implementations side by side: Mem0, Letta, Zep, Honcho, LangMem, MemGPT, Cognee, A-MEM, and four internal systems (Anthropic, OpenAI, JetBrains, Codex). Every implementation that holds up in production respects D1 and D2. The strongest also respect D3 and D4.
There is a fifth rule we are not yet sure about — call it D5, on cross-session memory sharing between users. We do not have enough cross-validation to write it down. When we do, it will land here.
The full ADR with failure modes lives in the spec repo. Push back, build on this, or tell us where we are wrong in Discord.
