Agent memory compression techniques: keeping context small without losing the thread

A long-running agent's working context grows turn by turn. Hit the model's context limit and you are forced into a bad choice: drop history (lose continuity) or summarise (lose detail). Six compression techniques mitigate the trade-off. Here is when to use each.

When compression matters

Three failure modes that compression solves:

Context overflow — the conversation exceeds the model limit.
Cost growth — every turn pays input tokens for the whole history.
Latency growth — long inputs slow first-token time.

For short interactions, compression is unnecessary. Past 20 turns or 50k tokens, it is mandatory.

The six techniques

1. Summary chains

After every K turns, replace the K-turn block with a 200-token summary. Repeat at the meta-level for very long sessions.

Strengths: simple; preserves narrative continuity.
Weaknesses: loses fine detail; one bad summary cascades.
Pick when: chat-heavy agents.

2. Hierarchical recall

Keep the last N turns verbatim, the last 10N as summaries, the last 100N as topic tags. Retrieve the right level based on the current turn's relevance.

Strengths: detail near, abstraction far.
Weaknesses: more storage, more retrieval logic.
Pick when: very long sessions, mixed-detail needs.

3. Semantic collapse

Cluster similar turns; keep one representative per cluster. Restore on demand if the agent asks for the cluster's content.

Strengths: efficient for repetitive conversations.
Weaknesses: complex; requires good embedding similarity.
Pick when: agents that revisit the same topics repeatedly.

4. Token pruning

Drop tokens unlikely to influence the next response. Filler words, redundant phrases, low-information acknowledgements.

Strengths: cheap, immediate.
Weaknesses: modest savings (10–20%).
Pick when: stretching a fixed budget.

5. Sliding window

Keep only the last K turns. Older history is gone.

Strengths: zero compression cost; predictable.
Weaknesses: loses long-range context; not suitable for memory-heavy use.
Pick when: task-focused short interactions.

6. Embedding compression

Replace verbose tool results with their embeddings + a compact summary. Restore on demand.

Strengths: large savings on tool-heavy turns.
Weaknesses: model cannot reason directly over embeddings.
Pick when: RAG-heavy agents with large retrieval results.

Combining techniques

Most production setups stack three:

Sliding window for the most recent turns.
Summary chain for the middle.
Hierarchical recall for the long tail.

Token pruning runs at every layer; embedding compression runs on tool-result blocks specifically.

The structure of a good summary

A summarisation prompt that survives chained compression:

Summarise this conversation segment in 150 tokens. Preserve:
- Decisions made
- Open questions
- User preferences and constraints
- Names and identifiers mentioned

Drop:
- Pleasantries
- Restated information
- Background reasoning that did not lead to a decision

Without explicit "preserve" guidance, summaries lose the high-value bits first.

Measuring compression quality

Two metrics matter:

Information retention — measured by an eval set: questions about the original session, asked of the compressed context.
Token savings — input tokens before vs after.

Target: 80% information retention at 30% original size. Below 70% retention, compression is too aggressive.

Cost model

For a 100-turn agent session at 500 tokens per turn (50k uncompressed):

Technique	Compressed size	Compression cost
Sliding window (last 20)	10k	0
Summary chain (every 20)	12k	1 Haiku call per chunk
Hierarchical recall	6k	2-3 Haiku calls per chunk
Embedding compression	8k	embedding cost only

Summary-chain wins on cost-quality balance for most cases.

Where embedding compression fits

When tool calls return huge results (a 10k-row SQL dump), embedding the result and storing both the embedding and a 100-token summary lets the model:

See the summary in context.
Retrieve the full result if it needs to reason over rows.

Saves 90%+ tokens on tool-heavy turns. Pairs well with retrieval-augmented memory.

Common mistakes

One-shot summarisation at the limit — quality cliff; compress incrementally.
No "preserve" guidance — summary loses the important bits.
Token pruning that breaks JSON — pruners must respect structure.
Sliding window for memory-heavy agents — they forget the user.

Where this is heading

Two trends by 2027: native compression primitives in the Claude Agent SDK (declare a compression policy, the SDK applies it), and longer context windows that move some of these problems into "less critical" territory. Until then, the techniques above are the working toolkit.