A long-running agent's working context grows turn by turn. Hit the model's context limit and you are forced into a bad choice: drop history (lose continuity) or summarise (lose detail). Six compression techniques mitigate the trade-off. Here is when to use each.
When compression matters
Three failure modes that compression solves:
- Context overflow — the conversation exceeds the model limit.
- Cost growth — every turn pays input tokens for the whole history.
- Latency growth — long inputs slow first-token time.
For short interactions, compression is unnecessary. Past 20 turns or 50k tokens, it is mandatory.
The six techniques
1. Summary chains
After every K turns, replace the K-turn block with a 200-token summary. Repeat at the meta-level for very long sessions.
- Strengths: simple; preserves narrative continuity.
- Weaknesses: loses fine detail; one bad summary cascades.
- Pick when: chat-heavy agents.
2. Hierarchical recall
Keep the last N turns verbatim, the last 10N as summaries, the last 100N as topic tags. Retrieve the right level based on the current turn's relevance.
- Strengths: detail near, abstraction far.
- Weaknesses: more storage, more retrieval logic.
- Pick when: very long sessions, mixed-detail needs.
3. Semantic collapse
Cluster similar turns; keep one representative per cluster. Restore on demand if the agent asks for the cluster's content.
- Strengths: efficient for repetitive conversations.
- Weaknesses: complex; requires good embedding similarity.
- Pick when: agents that revisit the same topics repeatedly.
4. Token pruning
Drop tokens unlikely to influence the next response. Filler words, redundant phrases, low-information acknowledgements.
- Strengths: cheap, immediate.
- Weaknesses: modest savings (10–20%).
- Pick when: stretching a fixed budget.
5. Sliding window
Keep only the last K turns. Older history is gone.
- Strengths: zero compression cost; predictable.
- Weaknesses: loses long-range context; not suitable for memory-heavy use.
- Pick when: task-focused short interactions.
6. Embedding compression
Replace verbose tool results with their embeddings + a compact summary. Restore on demand.
- Strengths: large savings on tool-heavy turns.
- Weaknesses: model cannot reason directly over embeddings.
- Pick when: RAG-heavy agents with large retrieval results.
Combining techniques
Most production setups stack three:
- Sliding window for the most recent turns.
- Summary chain for the middle.
- Hierarchical recall for the long tail.
Token pruning runs at every layer; embedding compression runs on tool-result blocks specifically.
The structure of a good summary
A summarisation prompt that survives chained compression:
Summarise this conversation segment in 150 tokens. Preserve:
- Decisions made
- Open questions
- User preferences and constraints
- Names and identifiers mentioned
Drop:
- Pleasantries
- Restated information
- Background reasoning that did not lead to a decision
Without explicit "preserve" guidance, summaries lose the high-value bits first.
Measuring compression quality
Two metrics matter:
- Information retention — measured by an eval set: questions about the original session, asked of the compressed context.
- Token savings — input tokens before vs after.
Target: 80% information retention at 30% original size. Below 70% retention, compression is too aggressive.
Cost model
For a 100-turn agent session at 500 tokens per turn (50k uncompressed):
| Technique | Compressed size | Compression cost |
|---|---|---|
| Sliding window (last 20) | 10k | 0 |
| Summary chain (every 20) | 12k | 1 Haiku call per chunk |
| Hierarchical recall | 6k | 2-3 Haiku calls per chunk |
| Embedding compression | 8k | embedding cost only |
Summary-chain wins on cost-quality balance for most cases.
Where embedding compression fits
When tool calls return huge results (a 10k-row SQL dump), embedding the result and storing both the embedding and a 100-token summary lets the model:
- See the summary in context.
- Retrieve the full result if it needs to reason over rows.
Saves 90%+ tokens on tool-heavy turns. Pairs well with retrieval-augmented memory.
Common mistakes
- One-shot summarisation at the limit — quality cliff; compress incrementally.
- No "preserve" guidance — summary loses the important bits.
- Token pruning that breaks JSON — pruners must respect structure.
- Sliding window for memory-heavy agents — they forget the user.
Where this is heading
Two trends by 2027: native compression primitives in the Claude Agent SDK (declare a compression policy, the SDK applies it), and longer context windows that move some of these problems into "less critical" territory. Until then, the techniques above are the working toolkit.