Skip to main content
Explainer7 min read

Persistent agent memory architecture: a reference design that scales

A reference architecture for agent memory that scales from 10 users to 10 million. Storage layers, retrieval strategy, privacy, and cost.

Cross-session memory is the architectural difference between a demo and a product. Going from "it works for me" to "it works for 10 million users" requires a storage layout, an ingestion path, and a deletion story that are all non-obvious. Here is a reference design we use in production.

The four storage layers

A scalable memory system does not live in a single store. Four layers each do something the others cannot:

  1. Hot working memory — Redis or in-process map. Current conversation state, live within milliseconds.
  2. Semantic facts — Postgres table of small key-value facts per user (preferences, handles, long-term context).
  3. Episodic log — Postgres + text search on session summaries, indexed by time and topic.
  4. Vector index — pgvector, Pinecone or Qdrant for semantic search across episodes.

Each layer has a different access pattern; combining them gives a balanced retrieval stack.

The ingestion path

At the end of each session:

  1. A summariser agent compresses the transcript to 200 tokens plus a JSON facts bundle.
  2. The facts bundle merges into the semantic store with contradiction detection.
  3. The summary goes to the episodic log with a timestamp and topic tags.
  4. The embedding of the summary goes to the vector index, keyed by episode id.

Latency target: end-of-session processing under 5 seconds; if you exceed that, move to an async queue.

The retrieval path

At the start of each turn:

  1. Load working memory synchronously.
  2. Load the user’s semantic facts bundle from Postgres (single query, cached).
  3. Hit the vector index with the current user message, top-k = 3-5 episodes.
  4. Merge into a single "memory context" block prepended to the system prompt.

Multi-tenant isolation

Every row carries tenant_id and user_id. Every query is parameterised on both. Row-level security in Postgres; namespace-per-tenant in vector stores. Failure here is not a bug, it is a breach.

Privacy and deletion

Three guarantees you must be able to offer:

  • Right-to-erasure by user — a single API call removes rows from all four layers.
  • Right-to-access — a dump endpoint that returns everything stored about the user in human-readable form.
  • Purpose scoping — memories tagged with the purpose they were created for; retrieval filters by active purpose.

See our GDPR guide for the legal framing.

Cost model

Rough per-user monthly cost at 100 sessions/month, 3k-token summaries:

Component Cost
Summarisation (Haiku) ~$0.005
Embedding ~$0.001
Postgres storage < $0.001
Vector index storage ~$0.002
Total ~$0.01 per user per month

Retrieval cost at read time is dominated by the prompt tokens the memory adds — typically 500-1000 tokens per turn, cached after the first hit.

Common mistakes

  • No contradiction detection on write — user says "actually I moved to Berlin" and both addresses end up in the index.
  • No decay — a three-year-old preference outvotes a three-day-old one in retrieval.
  • Only vector search — "what did I say about pricing last Tuesday" is a metadata query, not a semantic one.
  • Storing full transcripts — bloats the index, dilutes retrieval, inflates GDPR exposure.
Loadout

Build your AI agent loadout

Directory
Contact
© 2026 Loadout. Built on Angular 21 SSR.