Skip to main content
Guide4 min read

Agent memory compression techniques: keeping context small without losing the thread

A long-running agent's context grows until something breaks. Six compression techniques — summary chains, hierarchical recall, semantic collapse, token pruning, sliding windows, and embedding compression — with the trade-offs.

A long-running agent's working context grows turn by turn. Hit the model's context limit and you are forced into a bad choice: drop history (lose continuity) or summarise (lose detail). Six compression techniques mitigate the trade-off. Here is when to use each.

When compression matters

Three failure modes that compression solves:

  • Context overflow — the conversation exceeds the model limit.
  • Cost growth — every turn pays input tokens for the whole history.
  • Latency growth — long inputs slow first-token time.

For short interactions, compression is unnecessary. Past 20 turns or 50k tokens, it is mandatory.

The six techniques

1. Summary chains

After every K turns, replace the K-turn block with a 200-token summary. Repeat at the meta-level for very long sessions.

  • Strengths: simple; preserves narrative continuity.
  • Weaknesses: loses fine detail; one bad summary cascades.
  • Pick when: chat-heavy agents.

2. Hierarchical recall

Keep the last N turns verbatim, the last 10N as summaries, the last 100N as topic tags. Retrieve the right level based on the current turn's relevance.

  • Strengths: detail near, abstraction far.
  • Weaknesses: more storage, more retrieval logic.
  • Pick when: very long sessions, mixed-detail needs.

3. Semantic collapse

Cluster similar turns; keep one representative per cluster. Restore on demand if the agent asks for the cluster's content.

  • Strengths: efficient for repetitive conversations.
  • Weaknesses: complex; requires good embedding similarity.
  • Pick when: agents that revisit the same topics repeatedly.

4. Token pruning

Drop tokens unlikely to influence the next response. Filler words, redundant phrases, low-information acknowledgements.

  • Strengths: cheap, immediate.
  • Weaknesses: modest savings (10–20%).
  • Pick when: stretching a fixed budget.

5. Sliding window

Keep only the last K turns. Older history is gone.

  • Strengths: zero compression cost; predictable.
  • Weaknesses: loses long-range context; not suitable for memory-heavy use.
  • Pick when: task-focused short interactions.

6. Embedding compression

Replace verbose tool results with their embeddings + a compact summary. Restore on demand.

  • Strengths: large savings on tool-heavy turns.
  • Weaknesses: model cannot reason directly over embeddings.
  • Pick when: RAG-heavy agents with large retrieval results.

Combining techniques

Most production setups stack three:

  • Sliding window for the most recent turns.
  • Summary chain for the middle.
  • Hierarchical recall for the long tail.

Token pruning runs at every layer; embedding compression runs on tool-result blocks specifically.

The structure of a good summary

A summarisation prompt that survives chained compression:

Summarise this conversation segment in 150 tokens. Preserve:
- Decisions made
- Open questions
- User preferences and constraints
- Names and identifiers mentioned

Drop:
- Pleasantries
- Restated information
- Background reasoning that did not lead to a decision

Without explicit "preserve" guidance, summaries lose the high-value bits first.

Measuring compression quality

Two metrics matter:

  • Information retention — measured by an eval set: questions about the original session, asked of the compressed context.
  • Token savings — input tokens before vs after.

Target: 80% information retention at 30% original size. Below 70% retention, compression is too aggressive.

Cost model

For a 100-turn agent session at 500 tokens per turn (50k uncompressed):

Technique Compressed size Compression cost
Sliding window (last 20) 10k 0
Summary chain (every 20) 12k 1 Haiku call per chunk
Hierarchical recall 6k 2-3 Haiku calls per chunk
Embedding compression 8k embedding cost only

Summary-chain wins on cost-quality balance for most cases.

Where embedding compression fits

When tool calls return huge results (a 10k-row SQL dump), embedding the result and storing both the embedding and a 100-token summary lets the model:

  • See the summary in context.
  • Retrieve the full result if it needs to reason over rows.

Saves 90%+ tokens on tool-heavy turns. Pairs well with retrieval-augmented memory.

Common mistakes

  • One-shot summarisation at the limit — quality cliff; compress incrementally.
  • No "preserve" guidance — summary loses the important bits.
  • Token pruning that breaks JSON — pruners must respect structure.
  • Sliding window for memory-heavy agents — they forget the user.

Where this is heading

Two trends by 2027: native compression primitives in the Claude Agent SDK (declare a compression policy, the SDK applies it), and longer context windows that move some of these problems into "less critical" territory. Until then, the techniques above are the working toolkit.

Loadout

Build your AI agent loadout

The directory of MCP servers and AI agents that actually work. Pick the right loadout for Slack, Postgres, GitHub, Figma and 20+ integrations — with install commands ready to paste into Claude Desktop, Cursor or your own stack.

© 2026 Loadout. Built on Angular 21 SSR.