"Add a vector DB and the agent will remember" — a sentence that has burned more engineering hours than it has solved problems. Vectors are one piece of agent memory, not the whole. This is what a complete memory layer actually looks like.
What vectors do well
Semantic search. Given a query like "that customer who complained about shipping last month," vectors find the right conversation even when the exact words differ. Cosine-similarity over embeddings is the right tool for fuzzy recall.
Good embedding stores today: Qdrant, Weaviate, Milvus, pgvector for Postgres-native setups, and LanceDB for embedded use. All of them handle millions of vectors in hundreds of milliseconds.
What vectors do badly
Four things vectors fail at:
- Exact retrieval. "What is the user's email?" is a lookup, not a search. Vector search returns three plausible emails, not one correct one.
- Relationship traversal. "Find all orders by customers in the region this agent is working on." Needs joins, not similarity.
- Temporal ordering. "What was the last thing we discussed?" A vector DB returns the most similar thing, not the most recent.
- Counting / aggregation. "How many times has the user asked about pricing?" Vectors do not count; they rank.
Each of these calls for a different store. A production memory layer is hybrid by construction.
The five-layer memory stack
A complete agent memory, from hot to cold:
| Layer | Store | Purpose | Example query |
|---|---|---|---|
| Working | in-memory / context | current turn | "what did the user just say" |
| Episodic | append-only log + index | recent interactions, ordered | "what did we discuss yesterday" |
| Semantic (vector) | Qdrant/pgvector | fuzzy recall of facts | "anything about shipping issues" |
| Relational | Postgres / graph DB | entities and their links | "orders by this user in Q1" |
| Summary | summarised blobs | compressed long-term | "what has this user cared about over months" |
An agent's memory call flow picks from the right layer based on the query type, not a fixed default. Routing logic (a small classifier) decides which store to hit.
A concrete hybrid pattern
Combine pgvector (semantic) with a knowledge graph (relational). Here is the rough shape for a customer-support agent:
-- Facts table: fact text + its embedding
CREATE TABLE facts (
id BIGSERIAL PRIMARY KEY,
subject_id UUID NOT NULL,
predicate TEXT NOT NULL,
object TEXT NOT NULL,
text TEXT NOT NULL,
embedding VECTOR(1536) NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX facts_embedding_idx ON facts USING ivfflat (embedding vector_cosine_ops);
CREATE INDEX facts_subject_idx ON facts (subject_id);
A query like "what has this customer said about refunds" becomes:
-- 1. Find the customer (exact)
-- 2. Semantic search within their facts
SELECT text FROM facts
WHERE subject_id = $customer_id
ORDER BY embedding <=> $query_embedding
LIMIT 5;
You get the precision of SQL and the recall of embeddings. One store, two indices.
For deeper relational needs, promote facts to a property graph (Neo4j, Memgraph, or the pg_graph extensions). The pattern is the same: use the index that matches the query shape.
Managed services vs. self-hosted
The managed-memory space is crowded as of April 2026:
- mem0 — opinionated, fact-extraction-first. Good default.
- Zep — temporal knowledge graph. Strong for long-running chat.
- Letta (ex-MemGPT) — memory-first agent runtime.
- Pinecone Assistants — vector-heavy, less graph.
Trade-offs mirror any managed-vs-self debate. If memory is strategic to your product, self-host. If it is a means to ship faster, rent.
For architectural context, pair this with persistent agent memory architecture and cross-session agent memory.
The embedding choice
Pick embeddings based on three properties:
- Dimensionality vs. cost. 1536-dim (OpenAI text-embedding-3-large) is the default. 768-dim Cohere embeddings are half the storage and often within 2% recall.
- Domain fit. Code needs code-tuned embeddings (voyage-code-3, OpenAI with code prefix). Generic embeddings confuse
listthe data structure withlistthe verb. - Multilingual. If your users write in more than English, test recall across languages before committing.
Re-embedding when you switch models is expensive. Keep raw text alongside embeddings so you can re-embed without re-ingesting.
Common failure modes
- Stale embeddings. Model upgrade, no re-embed → silent recall degradation. Tag every embedding with model ID.
- Over-indexing. Storing every conversation turn as a separate vector blows up with noise. Extract facts, don't store transcripts.
- No eviction. A production memory fills forever. Decide TTL or importance-based eviction on day one.
- Wrong chunk size. 512-token chunks work for most content, not for legal or code. Test before committing.
Where this is heading
By mid-2027 expect hybrid memory to be a solved product category: drop-in stores that expose one API, route internally, and return blended results. Until then, build the hybrid yourself — the five-layer stack above will port forward cleanly.