Retrieval-augmented agent memory: RAG specifically built for agentic loops

Classical RAG is one retrieval, one prompt, one answer. Agentic RAG is many retrievals, refined by the agent's own intermediate output, woven into a multi-step plan. The pattern is different enough to deserve its own discipline. Here is what it looks like in production.

The shape of agentic RAG

Three things make agentic RAG different from classical RAG:

Retrieval is a tool, not a preprocessing step. The agent decides when to call it.
Queries are refined. First pass is the user prompt; later passes are model-generated based on what came back.
Retrieved content lives in memory, not just in the current turn — the agent reasons over the cumulative pool.

The result is far more accurate on multi-hop questions, far more expensive on token usage, and far harder to debug.

Reference loop

user ask
  → planner: decompose into sub-questions
  → for each sub-question:
      retrieve(sub_q) → top_k chunks
      synthesise partial answer
      score confidence
      if low: refine query, retrieve again
  → merge partial answers
  → final answer with citations

The loop is bounded by a step budget (typically 8–12) and a token budget (typically 30k input). When either hits, the agent returns its best partial answer.

Where to put the retriever

Three options, in order of preference:

As an MCP tool — the agent calls search_docs(query, k) like any other tool. Cleanest, replayable, easy to swap backends.
As a function inside the agent loop — fine for single-purpose agents.
As a preprocessor — only useful for the first pass; not enough for true agentic RAG.

Most production setups use MCP for the agent-driven part and a preprocessor for warm-start context.

Index design

The index serves two patterns: broad first-pass retrieval and narrow follow-up retrieval. Optimise for both:

Hybrid embedding + BM25 — handles fuzzy and exact-match queries equally well.
Chunk size 400–800 tokens with 50–100 token overlap.
Metadata fields for typed filtering (date, source, author).
Per-tenant namespaces — non-negotiable for multi-tenant.

The model writes follow-up queries badly without help. A small prompt template does the heavy lifting:

You retrieved these chunks. They do not answer: <sub-question>.
Write 1-3 alternative search queries that would. Different angles only.

Constrained queries beat free-form. Examples beat instructions.

Citation discipline

Every claim in the final answer must point to a chunk id from the retrieval set. Two reasons:

Auditability — see audit trails.
Hallucination guardrail — the agent literally cannot make up a source.

Implement as a post-processing pass: extract claim/citation pairs, verify each citation exists in the retrieval set, drop unverifiable claims with a warning.

Cost reality

A single agentic RAG turn typically costs 5–20x a classical RAG turn. The savings come from:

Caching retrievals — same query in the same window returns from cache. See MCP call caching.
Routing by complexity — only hard questions enter the full loop. See agent model routing.
Tight budgets — fail fast on long-running loops.

Without these, agentic RAG quietly becomes the most expensive line in your bill.

Failure modes

Query collapse — refinements all return the same chunks. Fix: enforce diversity in the refinement prompt.
Citation drift — the model cites a chunk that does not say what it claims. Fix: post-pass verification.
Budget overrun — agent loops until killed. Fix: strict step + token budget.
Multi-tenant leak — a query slips past the tenant filter. Fix: enforce in the retriever, not the prompt.

Where this is heading

Two shifts expected: native agentic RAG primitives in the Claude Agent SDK (planner+retriever+synthesiser as one block), and standardised retrieval interfaces in MCP itself (resources/search instead of bespoke tools). Both raise the floor; the patterns above will still apply.