Skip to main content
Explainer7 min read

Persistent agent memory architecture: a reference design that scales

A reference architecture for agent memory that scales from 10 users to 10 million. Storage layers, retrieval strategy, privacy, and cost.

Cross-session memory is the architectural difference between a demo and a product. Going from "it works for me" to "it works for 10 million users" requires a storage layout, an ingestion path, and a deletion story that are all non-obvious. Here is a reference design we use in production.

The four storage layers

A scalable memory system does not live in a single store. Four layers each do something the others cannot:

  1. Hot working memory — Redis or in-process map. Current conversation state, live within milliseconds.
  2. Semantic facts — Postgres table of small key-value facts per user (preferences, handles, long-term context).
  3. Episodic log — Postgres + text search on session summaries, indexed by time and topic.
  4. Vector index — pgvector, Pinecone or Qdrant for semantic search across episodes.

Each layer has a different access pattern; combining them gives a balanced retrieval stack.

The ingestion path

At the end of each session:

  1. A summariser agent compresses the transcript to 200 tokens plus a JSON facts bundle.
  2. The facts bundle merges into the semantic store with contradiction detection.
  3. The summary goes to the episodic log with a timestamp and topic tags.
  4. The embedding of the summary goes to the vector index, keyed by episode id.

Latency target: end-of-session processing under 5 seconds; if you exceed that, move to an async queue.

The retrieval path

At the start of each turn:

  1. Load working memory synchronously.
  2. Load the user’s semantic facts bundle from Postgres (single query, cached).
  3. Hit the vector index with the current user message, top-k = 3-5 episodes.
  4. Merge into a single "memory context" block prepended to the system prompt.

Multi-tenant isolation

Every row carries tenant_id and user_id. Every query is parameterised on both. Row-level security in Postgres; namespace-per-tenant in vector stores. Failure here is not a bug, it is a breach.

Privacy and deletion

Three guarantees you must be able to offer:

  • Right-to-erasure by user — a single API call removes rows from all four layers.
  • Right-to-access — a dump endpoint that returns everything stored about the user in human-readable form.
  • Purpose scoping — memories tagged with the purpose they were created for; retrieval filters by active purpose.

See our GDPR guide for the legal framing.

Cost model

Rough per-user monthly cost at 100 sessions/month, 3k-token summaries:

Component Cost
Summarisation (Haiku) ~$0.005
Embedding ~$0.001
Postgres storage < $0.001
Vector index storage ~$0.002
Total ~$0.01 per user per month

Retrieval cost at read time is dominated by the prompt tokens the memory adds — typically 500-1000 tokens per turn, cached after the first hit.

Common mistakes

  • No contradiction detection on write — user says "actually I moved to Berlin" and both addresses end up in the index.
  • No decay — a three-year-old preference outvotes a three-day-old one in retrieval.
  • Only vector search — "what did I say about pricing last Tuesday" is a metadata query, not a semantic one.
  • Storing full transcripts — bloats the index, dilutes retrieval, inflates GDPR exposure.
Loadout

Build your AI agent loadout

The directory of MCP servers and AI agents that actually work. Pick the right loadout for Slack, Postgres, GitHub, Figma and 20+ integrations — with install commands ready to paste into Claude Desktop, Cursor or your own stack.

© 2026 Loadout. Built on Angular 21 SSR.