Persistent agent memory architecture: a reference design that scales

Cross-session memory is the architectural difference between a demo and a product. Going from "it works for me" to "it works for 10 million users" requires a storage layout, an ingestion path, and a deletion story that are all non-obvious. Here is a reference design we use in production.

The four storage layers

A scalable memory system does not live in a single store. Four layers each do something the others cannot:

Hot working memory — Redis or in-process map. Current conversation state, live within milliseconds.
Semantic facts — Postgres table of small key-value facts per user (preferences, handles, long-term context).
Episodic log — Postgres + text search on session summaries, indexed by time and topic.
Vector index — pgvector, Pinecone or Qdrant for semantic search across episodes.

Each layer has a different access pattern; combining them gives a balanced retrieval stack.

The ingestion path

At the end of each session:

A summariser agent compresses the transcript to 200 tokens plus a JSON facts bundle.
The facts bundle merges into the semantic store with contradiction detection.
The summary goes to the episodic log with a timestamp and topic tags.
The embedding of the summary goes to the vector index, keyed by episode id.

Latency target: end-of-session processing under 5 seconds; if you exceed that, move to an async queue.

The retrieval path

At the start of each turn:

Load working memory synchronously.
Load the user’s semantic facts bundle from Postgres (single query, cached).
Hit the vector index with the current user message, top-k = 3-5 episodes.
Merge into a single "memory context" block prepended to the system prompt.

Multi-tenant isolation

Every row carries tenant_id and user_id. Every query is parameterised on both. Row-level security in Postgres; namespace-per-tenant in vector stores. Failure here is not a bug, it is a breach.

Privacy and deletion

Three guarantees you must be able to offer:

Right-to-erasure by user — a single API call removes rows from all four layers.
Right-to-access — a dump endpoint that returns everything stored about the user in human-readable form.
Purpose scoping — memories tagged with the purpose they were created for; retrieval filters by active purpose.

See our GDPR guide for the legal framing.

Cost model

Rough per-user monthly cost at 100 sessions/month, 3k-token summaries:

Component	Cost
Summarisation (Haiku)	~$0.005
Embedding	~$0.001
Postgres storage	< $0.001
Vector index storage	~$0.002
Total	~$0.01 per user per month

Retrieval cost at read time is dominated by the prompt tokens the memory adds — typically 500-1000 tokens per turn, cached after the first hit.

Common mistakes

No contradiction detection on write — user says "actually I moved to Berlin" and both addresses end up in the index.
No decay — a three-year-old preference outvotes a three-day-old one in retrieval.
Only vector search — "what did I say about pricing last Tuesday" is a metadata query, not a semantic one.
Storing full transcripts — bloats the index, dilutes retrieval, inflates GDPR exposure.