Overview

RAG and memory systems are how AI agents work with knowledge that exceeds their context window. Done well: agents give accurate, grounded answers. Done poorly: context overflow, hallucination from stale retrieval, and performance degradation.

This skill covers the design principles and failure modes of RAG and memory architectures for production AI systems.

When to Use

Building any AI system that needs to access external knowledge
When agent context windows are being exceeded
When agents need to remember information across sessions
When building Q&A, document analysis, or knowledge base systems

Process

Step 1: Choose the Right Memory Architecture

Identify what the agent needs to remember:
- Ephemeral: Within a single session (use in-context memory)
- Session-persistent: Across a user's sessions (use external key-value store)
- Knowledge base: Organizational or domain knowledge (use vector DB + RAG)
- Procedural: How to do tasks (encode in SKILL.md / system prompt)
Match the memory type to the store:

Memory Type	Recommended Store
In-session facts	Context window (summarized)
User preferences	Key-value store (Redis, DynamoDB)
Document corpus	Vector database (Pinecone, Weaviate, pgvector)
Long-term facts	Structured DB + caching

Verify: Each type of information the agent needs has a defined storage mechanism.

Step 2: Design the RAG Pipeline

Chunking strategy: Break documents into chunks at semantic boundaries (paragraphs, sections) — not arbitrary character counts.
Embedding model: Match the embedding model to your query type. Use the same model for indexing and retrieval.
Retrieval: Retrieve top-K most semantically similar chunks. K = 3–7 is usually optimal.
Re-ranking: After retrieval, re-rank by relevance using a cross-encoder. Top K becomes top 3–5 for the prompt.
Context injection: Inject retrieved chunks into the prompt with clear source citations.

Verify: Retrieved chunks are genuinely relevant to the query before injecting into context.

Step 3: Prevent Context Bloat

Summarize, don't accumulate: For long sessions, summarize previous turns rather than appending them indefinitely.
Retrieve, don't pre-load: Only load context relevant to the current query. Don't pre-load everything.
Set context budgets: Define maximum token allocations for: system prompt, retrieved context, conversation history, user message.
Compress before injecting: Summarize long retrieved documents to extract the relevant portion only.

Verify: Total prompt length is within model limits with buffer. Retrieved context is relevant to current query.

Step 4: Handle Retrieval Failures Gracefully

If retrieval returns no relevant results: say so — do not hallucinate an answer.
If retrieved documents are outdated: surface the document date to the user.
If confidence is low: present the retrieved source and let the user evaluate.
Design for "no relevant information found" as a first-class outcome.

Verify: System has defined behavior for failed/empty retrieval.

Step 5: Measure and Optimize

Track retrieval quality:
- Precision: Are retrieved chunks relevant to the query?
- Recall: Are relevant chunks being retrieved at all?
Track answer quality: Use RAGAS or similar evaluation framework.
Monitor: context length per query, retrieval latency, hallucination rate.

Verify: Baseline metrics established. Retrieval precision > 80%.

Common Rationalizations (and Rebuttals)

Excuse	Rebuttal
"Let's just put everything in the context"	Context bloat degrades quality and costs money. Retrieve what's needed.
"The model knows this from training"	Training knowledge is stale. Use RAG for current information.
"Vector search is good enough without re-ranking"	Re-ranking improves precision significantly. It's a small cost for large quality gain.
"We'll fix retrieval quality later"	Poor retrieval quality compounds into poor answer quality. Fix it now.

Red Flags

Entire document corpus pre-loaded into every prompt
Retrieval returning chunks from unrelated documents
No defined behavior for empty retrieval results
Context window regularly at 90%+ capacity
Agent answering from "training knowledge" instead of retrieved documents
No source citations for retrieved information

Verification

Memory architecture matches the type of information needed
RAG pipeline: chunk → embed → retrieve → re-rank → inject
Context budgets defined for all prompt sections
Empty retrieval has a defined graceful fallback
Retrieval precision measured and > 80%
Source citations included in AI responses

rag-and-memory

How to add

Drop this on your repo README

Related skills

template-skill

slack-gif-creator

baoyu-compress-image

zzz-one-dragon-player

Get new Outros skills every Monday