Cross-session memory is the architectural difference between a demo and a product. Going from "it works for me" to "it works for 10 million users" requires a storage layout, an ingestion path, and a deletion story that are all non-obvious. Here is a reference design we use in production.
The four storage layers
A scalable memory system does not live in a single store. Four layers each do something the others cannot:
- Hot working memory — Redis or in-process map. Current conversation state, live within milliseconds.
- Semantic facts — Postgres table of small key-value facts per user (preferences, handles, long-term context).
- Episodic log — Postgres + text search on session summaries, indexed by time and topic.
- Vector index — pgvector, Pinecone or Qdrant for semantic search across episodes.
Each layer has a different access pattern; combining them gives a balanced retrieval stack.
The ingestion path
At the end of each session:
- A summariser agent compresses the transcript to 200 tokens plus a JSON facts bundle.
- The facts bundle merges into the semantic store with contradiction detection.
- The summary goes to the episodic log with a timestamp and topic tags.
- The embedding of the summary goes to the vector index, keyed by episode id.
Latency target: end-of-session processing under 5 seconds; if you exceed that, move to an async queue.
The retrieval path
At the start of each turn:
- Load working memory synchronously.
- Load the user’s semantic facts bundle from Postgres (single query, cached).
- Hit the vector index with the current user message, top-k = 3-5 episodes.
- Merge into a single "memory context" block prepended to the system prompt.
Multi-tenant isolation
Every row carries tenant_id and user_id. Every query is parameterised on both. Row-level security in Postgres; namespace-per-tenant in vector stores. Failure here is not a bug, it is a breach.
Privacy and deletion
Three guarantees you must be able to offer:
- Right-to-erasure by user — a single API call removes rows from all four layers.
- Right-to-access — a dump endpoint that returns everything stored about the user in human-readable form.
- Purpose scoping — memories tagged with the purpose they were created for; retrieval filters by active purpose.
See our GDPR guide for the legal framing.
Cost model
Rough per-user monthly cost at 100 sessions/month, 3k-token summaries:
| Component | Cost |
|---|---|
| Summarisation (Haiku) | ~$0.005 |
| Embedding | ~$0.001 |
| Postgres storage | < $0.001 |
| Vector index storage | ~$0.002 |
| Total | ~$0.01 per user per month |
Retrieval cost at read time is dominated by the prompt tokens the memory adds — typically 500-1000 tokens per turn, cached after the first hit.
Common mistakes
- No contradiction detection on write — user says "actually I moved to Berlin" and both addresses end up in the index.
- No decay — a three-year-old preference outvotes a three-day-old one in retrieval.
- Only vector search — "what did I say about pricing last Tuesday" is a metadata query, not a semantic one.
- Storing full transcripts — bloats the index, dilutes retrieval, inflates GDPR exposure.