Skip to main content
Guide4 min read

Vector memory for AI agents: why pure embeddings are not enough

Vector search is a building block, not a memory system. Here is how production agents combine vectors with graph, summary, and episodic stores to actually remember.

"Add a vector DB and the agent will remember" — a sentence that has burned more engineering hours than it has solved problems. Vectors are one piece of agent memory, not the whole. This is what a complete memory layer actually looks like.

What vectors do well

Semantic search. Given a query like "that customer who complained about shipping last month," vectors find the right conversation even when the exact words differ. Cosine-similarity over embeddings is the right tool for fuzzy recall.

Good embedding stores today: Qdrant, Weaviate, Milvus, pgvector for Postgres-native setups, and LanceDB for embedded use. All of them handle millions of vectors in hundreds of milliseconds.

What vectors do badly

Four things vectors fail at:

  1. Exact retrieval. "What is the user's email?" is a lookup, not a search. Vector search returns three plausible emails, not one correct one.
  2. Relationship traversal. "Find all orders by customers in the region this agent is working on." Needs joins, not similarity.
  3. Temporal ordering. "What was the last thing we discussed?" A vector DB returns the most similar thing, not the most recent.
  4. Counting / aggregation. "How many times has the user asked about pricing?" Vectors do not count; they rank.

Each of these calls for a different store. A production memory layer is hybrid by construction.

The five-layer memory stack

A complete agent memory, from hot to cold:

Layer Store Purpose Example query
Working in-memory / context current turn "what did the user just say"
Episodic append-only log + index recent interactions, ordered "what did we discuss yesterday"
Semantic (vector) Qdrant/pgvector fuzzy recall of facts "anything about shipping issues"
Relational Postgres / graph DB entities and their links "orders by this user in Q1"
Summary summarised blobs compressed long-term "what has this user cared about over months"

An agent's memory call flow picks from the right layer based on the query type, not a fixed default. Routing logic (a small classifier) decides which store to hit.

A concrete hybrid pattern

Combine pgvector (semantic) with a knowledge graph (relational). Here is the rough shape for a customer-support agent:

-- Facts table: fact text + its embedding
CREATE TABLE facts (
  id BIGSERIAL PRIMARY KEY,
  subject_id UUID NOT NULL,
  predicate TEXT NOT NULL,
  object TEXT NOT NULL,
  text TEXT NOT NULL,
  embedding VECTOR(1536) NOT NULL,
  created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX facts_embedding_idx ON facts USING ivfflat (embedding vector_cosine_ops);
CREATE INDEX facts_subject_idx  ON facts (subject_id);

A query like "what has this customer said about refunds" becomes:

-- 1. Find the customer (exact)
-- 2. Semantic search within their facts
SELECT text FROM facts
WHERE subject_id = $customer_id
ORDER BY embedding <=> $query_embedding
LIMIT 5;

You get the precision of SQL and the recall of embeddings. One store, two indices.

For deeper relational needs, promote facts to a property graph (Neo4j, Memgraph, or the pg_graph extensions). The pattern is the same: use the index that matches the query shape.

Managed services vs. self-hosted

The managed-memory space is crowded as of April 2026:

  • mem0 — opinionated, fact-extraction-first. Good default.
  • Zep — temporal knowledge graph. Strong for long-running chat.
  • Letta (ex-MemGPT) — memory-first agent runtime.
  • Pinecone Assistants — vector-heavy, less graph.

Trade-offs mirror any managed-vs-self debate. If memory is strategic to your product, self-host. If it is a means to ship faster, rent.

For architectural context, pair this with persistent agent memory architecture and cross-session agent memory.

The embedding choice

Pick embeddings based on three properties:

  1. Dimensionality vs. cost. 1536-dim (OpenAI text-embedding-3-large) is the default. 768-dim Cohere embeddings are half the storage and often within 2% recall.
  2. Domain fit. Code needs code-tuned embeddings (voyage-code-3, OpenAI with code prefix). Generic embeddings confuse list the data structure with list the verb.
  3. Multilingual. If your users write in more than English, test recall across languages before committing.

Re-embedding when you switch models is expensive. Keep raw text alongside embeddings so you can re-embed without re-ingesting.

Common failure modes

  • Stale embeddings. Model upgrade, no re-embed → silent recall degradation. Tag every embedding with model ID.
  • Over-indexing. Storing every conversation turn as a separate vector blows up with noise. Extract facts, don't store transcripts.
  • No eviction. A production memory fills forever. Decide TTL or importance-based eviction on day one.
  • Wrong chunk size. 512-token chunks work for most content, not for legal or code. Test before committing.

Where this is heading

By mid-2027 expect hybrid memory to be a solved product category: drop-in stores that expose one API, route internally, and return blended results. Until then, build the hybrid yourself — the five-layer stack above will port forward cleanly.

Loadout

Build your AI agent loadout

Directory
Contact
© 2026 Loadout. Built on Angular 21 SSR.