The core idea
RAG separates knowledge from reasoning. Your LLM handles language; a vector store handles lookup. You win on freshness without retraining.
The retrieval layer
- Embed documents at ingest (OpenAI, Cohere, or local via llama.cpp)
- Store vectors in Pinecone, pgvector, or Cloudflare Vectorize
- At query time: embed the question → nearest-neighbour search → inject top-k chunks into prompt
When RAG underperforms
- Source documents are low quality or contradictory
- Retrieval granularity is wrong (whole pages vs paragraphs)
- The model ignores retrieved context (hallucination still happens)
Takeaway
RAG is not magic. It moves the quality problem from the model to your retrieval pipeline. Invest in chunking strategy and eval before scaling.