Interactive Guide to RAG Techniques: From Vanilla to Agentic

Published on
Arnab Mondal-
45 min read

Overview

Ever wondered how ChatGPT-like systems can answer questions about your own documents? The answer is RAG (Retrieval-Augmented Generation). In this interactive guide, you won't just read about RAG—you'll experience it firsthand by chatting with different RAG implementations, all powered by my blog posts and portfolio content. We are going to explore all the RAG techniques today.

What is RAG?

is a game-changing pattern that solves one of the biggest limitations of : they don't know about your private data.

Think about it—ChatGPT was trained on internet data up to a certain date. It doesn't know about:

  • Your company's internal documents
  • Recent blog posts you've written
  • The specific codebase of your project

RAG bridges this gap by relevant context from your own data and feeding it to the LLM along with the user's question.

The RAG Pipeline (High-Level)

INGESTION PIPELINEDocumentsBlog posts, PagessplitChunkingSmaller piecesembedEmbedding ModelText to vectorsstoreVector DatabasePincecone DB[0.2, 0.8, ...]QUERY PIPELINEUser Query"What projects..."Embed QuerySame modelsearchSimilarity SearchFind top-k chunksretrieveAugment PromptAdd contextLLM GenerationGemini 2.5 FlashInformed response

What is a Vector Database?

Before diving into RAG techniques, let's understand the backbone of any RAG system: .

Traditional databases are great at exact matches:

  • SELECT * FROM posts WHERE title = "React hooks"
  • Find user with email = "arnab@example.com"

But what if someone asks: "What did Arnab write about frontend frameworks?"

The word "React" might not appear anywhere, but your content about React, Vue, or Next.js is absolutely relevant. This is where comes in.

KEYWORD SEARCHUser Query"frontend frameworks"LIKE %?%SQL Databasetitle LIKE '%...'content LIKE '%...'No Results FoundExact keyword "frontend frameworks" not in documents(But you have posts about React, Vue, Next.js...)SEMANTIC SEARCHUser Query"frontend frameworks"embedVector DatabaseSimilarity Searchcosine(q, docs)Relevant Results Found"React Hooks""Vue Tutorial""Next.js 15"Finds semantically related content - no exact match needed

Embeddings: Words as Numbers

convert text into numerical vectors (arrays of numbers). The magic? Similar meanings produce similar vectors.

ts
1// Example embeddings (simplified to 3D for visualization) 2"I love programming"[0.8, 0.2, 0.5] 3"Coding is my passion"[0.75, 0.25, 0.48] // Very similar! 4"I hate vegetables"[0.1, 0.9, 0.3] // Very different!

In reality, embeddings have 768 to 3072 dimensions, capturing nuanced semantic relationships.

Vector Search: Finding Similar Meanings

A vector database like stores these embeddings and enables lightning-fast similarity searches using algorithms like:

  • : Measures the angle between vectors
  • : Measures straight-line distance
  • : Combines magnitude and direction

When you query "frontend frameworks", the database finds vectors closest to your query's embedding—even if the stored documents say "React", "Vue", or "Next.js".


Naive RAG (Vanilla RAG)

Now let's dive into the first and simplest RAG technique: Naive RAG (also called Vanilla RAG). In simple terms, Naive RAG is like an open-book exam. When you ask a question, the system searches your documents for the most relevant parts. It then gives those parts to the AI as "cheat sheets" so it can answer based on your specific data instead of just guessing.

How Vanilla RAG Works

User Query"hackathons..."Input1Embed Query[0.2, 0.8, ...]Text to vector2Similarity SearchPinecone DBFind top-k chunks3Augment Promptcontext + queryStuff into prompt4LLM GenerationGemini 2.5 FlashInformed response

The Algorithm

ts
1// Pseudocode: Naive RAG 2async function naiveRAG(userQuery) { 3 // 1. EMBED QUERY 4 // Convert the user's question into a mathematical vector 5 const queryEmbedding = await embeddingModel.embed(userQuery); 6 7 // 2. RETRIEVE 8 // Find the top 'k' most similar pieces of content from our database 9 const relevantChunks = await vectorDatabase.search(queryEmbedding, { 10 limit: 5, 11 }); 12 13 // 3. AUGMENT 14 // Combine the retrieved content into a single context block 15 const context = relevantChunks.map((chunk) => chunk.text).join('\n'); 16 17 // Create a prompt that includes both the context and the question 18 const prompt = buildPrompt(userQuery, context); 19 20 // 4. GENERATE 21 // Feed everything to the LLM to get the final answer 22 return await llm.generate(prompt); 23}

view the full source code here

Pros and Cons

ProsCons
Simple to implementNo query understanding
Fast retrievalFixed boundaries
Works well for straightforward questionsMay miss relevant context
Low computational overheadStruggles with
Great starting baselineNo semantic ranking

When to Use Vanilla RAG

  • Prototyping: Quick proof-of-concept
  • Simple Q&A: Direct questions with clear answers in documents
  • Resource constraints: When you need minimal latency and cost
  • Baseline comparison: Before trying advanced techniques

Try Vanilla RAG Yourself

Vanilla RAG Chat

Ask anything about Arnab's portfolio

Try clicking one of the suggestions below

Powered by Gemini 2.5 Flash + Pinecone DB. Searches across blog posts and portfolio content.


Hybrid RAG

Now let's explore a more powerful technique: Hybrid RAG. This approach addresses a fundamental limitation of Vanilla RAG by combining the best of both worlds—semantic understanding AND exact keyword matching.

The Problem with Vanilla RAG

Remember how Vanilla RAG uses only dense embeddings for retrieval? While semantic search is great at understanding meaning, it has a critical weakness:

It struggles with specific terms, names, and technical jargon.

Consider this query: "What projects use Next.js 14?"

  • might find content about "modern React frameworks" or "server-side rendering"—semantically related, but missing the exact version.
  • A document mentioning "Utilizes Next.js 14 app router" might rank lower than "Introduction to React frameworks" because the embedding model doesn't weight "14" as heavily as the semantic meaning.

This is where Hybrid RAG shines.

What is Hybrid RAG?

combines dense retrieval (semantic embeddings) with (keyword matching like ) to overcome the limitations of using either approach alone.

Think of it like having two search experts:

  • Expert A (Dense): Understands the meaning and intent behind your question
  • Expert B (Sparse): Excellent at finding exact matches for specific terms

Hybrid RAG asks both experts and intelligently combines their answers.

Want to understand how dense retrieval works under the hood? Check out my deep dive on Dense Passage Retrieval (DPR).

How Hybrid RAG Works

User Query"React hooks..."Input1aDense RetrievalVector SimilaritySemantic meaningDense ResultsTop-k vectors+ scores1bSparse RetrievalBM25 / TF-IDFKeyword matchingSparse ResultsTop-k matches+ scores2Score FusionRRF / WeightedCombine & rank3LLM GenerationBest of both worldsInformed response

The Algorithm

To combine the results, Hybrid RAG typically uses . This algorithm, detailed in the paper Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods, works by ranking documents based on their position in the search results rather than their raw confidence scores. This ensures a fair comparison between the disparate scoring scales of dense and sparse retrieval.

ts
1// Pseudocode: Hybrid RAG 2async function hybridRAG(userQuery) { 3 // 1. PARALLEL RETRIEVAL 4 // Run two different search strategies simultaneously 5 const [denseResults, sparseResults] = await Promise.all([ 6 // Strategy A: Understands meaning (Vector Search) 7 vectorDB.search(userQuery), 8 // Strategy B: Matches exact keywords (BM25) 9 keywordDB.search(userQuery), 10 ]); 11 12 // 2. FUSION (Reciprocal Rank Fusion) 13 // Combine results by respecting the ranking order of each system 14 // This evens out the playing field between different scoring methods 15 const fusedResults = reciprocalRankFusion(denseResults, sparseResults); 16 17 // 3. GENERATE 18 // Use the best results from the combined list to answer 19 const topResults = fusedResults.slice(0, 5); 20 return await llm.generateWithContext(userQuery, topResults); 21} 22 23// Helper: RRF Ranking 24// Simply puts: "If a document is ranked high in BOTH lists, it wins." 25function reciprocalRankFusion(listA, listB) { 26 // ...implementation details... 27 return combinedRankedList; 28}

view the full source code here

Pros and Cons

ProsCons
Best of both semantic and keyword searchMore complex infrastructure
Handles specific terms and jargon wellTwo search indices to maintain
More robust to vocabulary mismatchSlightly higher latency (parallel helps)
Better for diverse query typesFusion algorithm tuning required
Industry-proven approachHigher resource usage

When to Use Hybrid RAG

  • Technical documentation: Exact API names, version numbers, function names matter
  • Legal/medical content: Precise terminology is critical
  • E-commerce: Product codes, model numbers, brand names
  • Enterprise search: Internal jargon, project names, acronyms
  • When Vanilla RAG isn't cutting it: Upgrade path when semantic-only fails

Try Hybrid RAG Yourself

Hybrid RAG Chat

Hybrid RAG: Semantic + Keyword Search

Great for specific terms, versions, and technical queries

Combines dense (semantic) + sparse (BM25) search. Click the settings icon to adjust the balance between keyword precision and semantic understanding.


Re-ranking RAG

Now let's level up with Re-ranking RAG—a technique that adds a precision layer on top of your retrieval. This is often the single biggest improvement you can make to RAG quality without changing your data or embeddings.

The Problem with Single-Stage Retrieval

Both Vanilla and Hybrid RAG use for retrieval—models that encode queries and documents independently:

ts
1// Bi-Encoder approach: Independent encoding 2const queryVector = await embed(query); // Query → [0.1, 0.2, ...] 3const docVector = await embed(document); // Doc → [0.3, 0.4, ...] 4const score = cosineSimilarity(queryVector, docVector); // Compare vectors

The fundamental limitation: There's no direct interaction between query and document . The model can't understand nuanced relationships like:

  • "What Arnab did NOT do" vs "What Arnab did"
  • "AWS certification" vs "uses AWS for deployment"
  • "internship at AI company" vs "worked on AI projects"

This leads to false positives—documents that are semantically similar but not actually relevant to the specific question.

What is Re-ranking RAG?

adds a second-stage precision filter using a :

In simple words, it's like Googling something and then carefully reading the top 50 results to pick the absolute best 5 answers, instead of just trusting the search engine's top 5 blindly. First, you retrieve a lot of results fast (Retrieval), then you re-score them carefully (Re-ranking).

StageModel TypeSpeedPrecisionPurpose
Stage 1Bi-Encoder⚡ FastMediumCast wide net (high recall)
Stage 2Cross-Encoder🐢 SlowerHighPrecision filtering

The key insight: Cross-encoders process query and document together:

ts
1// Cross-Encoder approach: Joint encoding 2// [CLS] query tokens [SEP] document tokens [SEP] → Transformer → relevance score 3const score = await crossEncoder.score(query, document); // 0.0 to 1.0
  • [CLS] (Classification): Added at the very beginning. The model uses the final state of this token to represent the entire input pair (Query + Document) and predict the relevance score.
  • [SEP] (Separator): Added between the Query and the Document to tell the model where one ends and the other begins (and also at the end).

With full between query and document tokens, cross-encoders capture semantic relationships, negations, and context that bi-encoders miss.

How Re-ranking RAG Works

Stage 1: Recall (Fast)Stage 2: Precision (Accurate)User Query"AWS certs?"Input1Bi-EncoderVector Searchembed(query)embed(doc)Independent encodingCandidatesTop-K (20-50)High recall2Cross-EncoderRe-ranking[CLS] query [SEP]document [SEP]Joint attentionHigh precisionRe-rankedTop-N (5-10)123Best matches3LLMGenerationPrecise context= better answer

The Algorithm

ts
1// Pseudocode: Re-ranking RAG 2async function rerankRAG(userQuery) { 3 // 1. INITIAL RETRIEVAL (Recall Phase) 4 // Cast a wide net to get many potential candidates (e.g., top 50) 5 // Fast, but might include irrelevant results 6 const candidates = await vectorDB.search(userQuery, { k: 50 }); 7 8 // 2. RE-RANKING (Precision Phase) 9 // Use a smarter, slower model to score each candidate against the query 10 // "How relevant is this SPECIFIC document to this SPECIFIC question?" 11 const rerankedResults = await crossEncoder.rerank({ 12 query: userQuery, 13 documents: candidates, 14 top: 5, 15 }); 16 17 // 3. GENERATE 18 // Feed only the high-quality, re-ranked results to the LLM 19 return await llm.generateWithContext(userQuery, rerankedResults); 20}

view the full source code here

Which Re-ranker Model?

This implementation uses , chosen for:

  • Low latency (~100-200ms) suitable for real-time applications
  • High quality comparable to larger models
  • Multilingual support for diverse content
  • 128K context handles long documents
  • Free tier (1000 calls/month) for experimentation

Other options include Cohere's rerank-v4.0-pro (higher quality, higher latency), Jina AI Re-ranker, Voyage AI rerank-2, or self-hosted models like cross-encoder/ms-marco-MiniLM-L-6-v2.

Pros and Cons

ProsCons
Significantly higher precisionAdditional latency (100-300ms)
Understands negations and contextExtra API cost per query
Works on top of existing retrievalLimited by candidate quality
No re-indexing requiredMore complex pipeline
Often the biggest single improvementDiminishing returns past ~50 candidates

Advanced Optimizations

The implementation in this demo includes several production-ready optimizations:

Score Fusion: Combines the original retrieval score with the re-rank score for nuanced ranking:

ts
1// Weighted combination of bi-encoder and cross-encoder scores 2const finalScore = 0.7 * rerankScore + 0.3 * originalScore;

Adaptive Re-ranking: Skips re-ranking when the initial retrieval confidence is already very high (e.g. top result score > 0.95 with clear separation).

Query-Aware Over-retrieval: Retrieves more candidates for complex/vague queries and fewer for specific ones to balance recall and latency.

Other Potential Optimizations

While not implemented in this specific demo, production systems often include:

Deduplication: Removing near-duplicate chunks before re-ranking to maximize candidate diversity.

Caching: Caching re-ranking results for repeated queries to reduce API costs and latency.

When to Use Re-ranking RAG

  • Precision-critical applications: When wrong answers are costly
  • Complex queries: Multi-faceted questions requiring nuanced understanding
  • After optimizing retrieval: When you've maxed out embedding/chunking improvements
  • Technical content: Where subtle differences in wording matter greatly
  • Production RAG systems: The standard for quality-focused deployments

Try Re-ranking RAG Yourself

Re-ranking RAG Chat

Re-ranking RAG: Two-Stage Precision Retrieval

Cross-encoder re-ranking for higher precision answers

Two-stage retrieval: Click the settings icon to tune candidate count and final results. Uses Cohere cross-encoder for precision re-ranking.


Multi-Query RAG

Now let's explore Multi-Query RAG—a technique that dramatically improves recall by reformulating your question into multiple diverse queries. This addresses a fundamental limitation: a single query often misses relevant documents due to vocabulary mismatch.

The Problem with Single-Query Retrieval

Consider a user asking: "What machine learning projects has Arnab worked on?"

Even with the best embeddings, a single query might miss:

  • Documents using "AI" instead of "machine learning"
  • Content about "neural networks" or "deep learning"
  • Projects mentioning "TensorFlow" or "PyTorch" without saying "ML"
  • Experience with "NLP" or "computer vision"

The embedding model does its best, but semantic similarity has limits—especially with technical jargon, acronyms, and domain-specific terms.

What is Multi-Query RAG?

uses an LLM to generate alternative phrasings of the user's question, then retrieves documents for all variations in parallel. The results are combined using to create a comprehensive, diverse result set.

Think of it like asking the same question in multiple ways to different search engines, then combining the best answers from each.

How Multi-Query RAG Works

Stage 1: Query ExpansionStage 2: Parallel RetrievalStage 3: Fusion & GenerationUser Query"ML projects?"Input1LLMQuery GeneratorQ1: "ML projects"Q2: "AI experience"Q3: "deep learning"Q4: "neural nets"4-5 variations2Embed + Search Q1k=5 results2Embed + Search Q2k=5 results2Embed + Search Q3k=5 results2Embed + Search Q4k=5 resultsResults 15 docsResults 25 docsResults 35 docsResults 45 docs3RRF FusionReciprocal Rank1/(k + rank)DeduplicateSum scoresFusedTop-N (5)4x3x2xVoted ranking4LLMGenerationWith context

The Algorithm

ts
1// Pseudocode: Multi-Query RAG 2async function multiQueryRAG(originalQuery) { 3 // 1. QUERY EXPANSION 4 // Ask an LLM to generate 4-5 different ways to ask the same question 5 // This helps catch documents that use different terminology 6 const queryVariations = await llm.generateVariations(originalQuery); 7 8 // 2. PARALLEL RETRIEVAL 9 // Search for ALL variations at once 10 const allResults = await Promise.all(queryVariations.map((query) => vectorDB.search(query))); 11 12 // 3. FUSION 13 // Combine results. Documents that appear in multiple searches get a higher score. 14 // (We use Reciprocal Rank Fusion here) 15 const fusedResults = reciprocalRankFusion(allResults); 16 17 // 4. GENERATE 18 // Answer using the uniquely robust set of documents 19 return await llm.generateWithContext(originalQuery, fusedResults); 20}

view the full source code here

Pros and Cons

ProsCons
Significantly higher recallLLM call adds latency (~100-300ms)
Handles vocabulary mismatchMultiplies embedding calls (4-5x)
Documents "voted" by multiple queries rank higherMore API costs per query
Captures different perspectives of the questionQuery generation quality matters
Works with any base retrieval methodDiminishing returns past 5-6 variations

Advanced Optimizations

The implementation in this demo includes several production-ready optimizations:

Deduplication by Similarity: Removes near-duplicate chunks that appear across different query results. Adjacent chunks from the same document are merged to maximize diversity:

ts
1// Skip chunks that are adjacent in the same document 2if (existing.slug === candidate.slug && Math.abs(existing.chunkIndex - candidate.chunkIndex) <= 1) { 3 // Likely overlapping content - skip 4}

Weighted Score Fusion: An alternative to pure RRF that considers the original similarity scores. Useful when you trust the embedding quality:

ts
1// Weight original scores instead of just ranks 2const finalScore = weight * originalScore;

Hybrid Base Retrieval: Can use either vanilla (dense-only) or hybrid (dense + sparse) retrieval as the for each query variation, combining the benefits of both approaches.

Re-ranking After Fusion: Optionally apply cross-encoder re-ranking to the fused results. This combines the high recall of multi-query with the precision of re-ranking, and is especially powerful when fused results contain many "voted" documents that need precision ranking.

Other Potential Optimizations

While not implemented in this specific demo, production systems often include:

Query Caching: Cache generated query variations for repeated or similar queries to reduce LLM calls.

Adaptive Query Count: Generate fewer variations for specific queries and more for vague or exploratory ones.

Query Quality Filtering: Score generated queries and drop low-quality variations before retrieval.

When to Use Multi-Query RAG

  • Exploratory queries: Broad questions that could match many document types
  • Domain-specific content: Technical jargon where synonyms matter
  • Comprehensive answers: When you need to cover all angles of a topic
  • Recall-critical applications: When missing relevant documents is costly
  • Before re-ranking: Generate diverse candidates, then use cross-encoder to precision-rank

Try Multi-Query RAG Yourself

Multi-Query RAG Chat

Multi-Query RAG: Diverse Query Expansion

LLM generates query variations for comprehensive coverage

Click the settings icon to configure query variations, retrieval method, and re-ranking. Results are fused with Reciprocal Rank Fusion (RRF).


Parent-Child RAG (Hierarchical Chunking)

Now let's explore Parent-Child RAG—a technique that solves the fundamental tension between retrieval precision and generation quality by using two levels of chunks: small chunks for searching, large chunks for answering.

The Problem with Fixed Chunk Sizes

Consider how we chunk documents in standard RAG:

Chunk SizeRetrieval QualityGeneration Quality
Small (200-300 tokens)High precision - focused embeddingsPoor - lost context, fragmented info
Large (1000+ tokens)Low precision - diluted embeddingsGood - rich context for LLM

The dilemma: Small chunks give precise retrieval but poor context. Large chunks give rich context but imprecise retrieval. You can't win with a single chunk size.

What is Parent-Child RAG?

uses two chunk sizes to get the best of both worlds:

In simple words, you search using small detailed snippets to find specific matches, but you give the AI a big, complete chunk of text (the parent) so it understands the full context and can answer correctly.

LevelSizePurposeStorage
Child200-400 tokensRetrieval (precise vector matching)Vector index (searchable)
Parent800-1500 tokensGeneration (context for LLM) (returned, not searched)

The key insight: search on small, focused chunks but return large, context-rich chunks to the LLM.

How Parent-Child RAG Works

Stage 1: Hierarchical Chunking (Ingestion)Stage 2: Child Search + Parent Retrieval (Query)DocumentEinstein BioFull contentsplit1Parent Chunks~1000 tokensChapter 3Full biography sectionChapter 4Nobel Prize eraRich contextsplit2Child Chunks~200 tokens"Born 1879...""1905 theory...""Nobel 1921...""Princeton..."+ parentIdUser Query"Einstein 1905?"Inputembed3Vector IndexSearch child chunkscosine similarityMatch: "1905..."lookup4Parent LookupChild matched→ Return parentFull Chapter 3Born 1879...theory 1905...Nobel 1921...Princeton...Full context!5LLMGenerationContext: 1000 tokComplete answerwith full contextKey Insight: Search small, Return largeChild: precise match | Parent: rich context for LLM

The architecture has two distinct phases:

Ingestion Time (Steps 1-3):

  1. Split document into large parent chunks (~1000 tokens)
  2. Split each parent into small child chunks (~200 tokens)
  3. Embed only child chunks, storing parentId and parentContent in metadata

Query Time (Steps 4-5): 4. Search child vectors, find matches, look up parent via parentId 5. Return deduplicated parent chunks to LLM for generation

The Algorithm

ts
1// Pseudocode: Parent-Child RAG 2async function parentChildRAG(userQuery) { 3 // 1. SEARCH SMALL (Precision) 4 // Search against small "child" chunks (e.g., 200 words) 5 // These are great for matching specific details in the query 6 const childMatches = await vectorDB.search(userQuery); 7 8 // 2. RETRIEVE LARGE (Context) 9 // Instead of using the small chunk, fetch its "parent" chunk (e.g., 1000 words) 10 // This provides the LLM with the full surrounding context 11 const parentChunks = matchesToParents(childMatches); 12 13 // 3. DEDUPLICATE 14 // If multiple child chunks point to the same parent, only use the parent once 15 const uniqueParents = unique(parentChunks); 16 17 // 4. GENERATE 18 // Answer using the rich, full-context parent chunks 19 return await llm.generateWithContext(userQuery, uniqueParents); 20}

Example: Why Parent Context Matters

Query: "Einstein 1905 theory"

Child Match (200 tokens):

text
1He published relativity in 1905...

Parent Returned (1000 tokens):

text
1Einstein Biography - Chapter 3 2 3Albert Einstein was born in Ulm, Germany in 1879. After struggling 4in traditional schooling, he found his passion in physics and 5mathematics at the Swiss Federal Polytechnic. 6 7In 1905, his 'Annus Mirabilis' (miracle year), Einstein published 8four groundbreaking papers on the photoelectric effect, Brownian 9motion, special relativity, and mass-energy equivalence. 10 11The special relativity paper introduced E=mc² and fundamentally 12changed our understanding of space and time...

The LLM now has full biographical context—not just the 1905 mention, but Einstein's background, education, and the significance of his discoveries.

Pros and Cons

ProsCons
Precise retrieval with focused embeddingsIncreased storage (parent content in metadata)
Rich context for coherent LLM responsesSlightly more complex
No mid-sentence cuts - parents are semantically completeMetadata size limits (Pinecone: 40KB)
Reduced hallucination - more context = less guessingFixed parent boundaries may split related content
Deduplication handles multiple child matches naturallyMore embedding calls during ingestion

Advanced Optimizations

The implementation includes several production-ready optimizations:

Metadata-Based Storage: Parent content is stored directly in child chunk metadata, eliminating the need for a separate storage system or secondary lookups:

ts
1// Each child vector includes its parent in metadata 2{ 3 id: "child-blog-post-0-2", 4 vector: [...childEmbedding], 5 metadata: { 6 content: "...child content...", 7 parentId: "parent-blog-post-0", 8 parentContent: "...full parent content (~1000 tokens)...", 9 } 10}

Parent Deduplication: When multiple children from the same parent match the query, the parent is returned only once with the best child's score. This prevents context duplication and maximizes diversity.

Optional Re-ranking: After parent deduplication, a cross-encoder can re-rank the parent chunks for even higher precision—combining the high recall of hierarchical retrieval with the accuracy of re-ranking.

Other Potential Optimizations

While not implemented in the current version, these optimizations could further improve performance:

: Create parents with overlap to ensure related content at boundaries isn't split awkwardly.

Dynamic Parent Sizing: Adjust parent boundaries based on content structure (e.g., respect section headers).

Multi-Level Hierarchy: For very long documents, add a grandparent level (Grandparent → Parent → Child).

When to Use Parent-Child RAG

  • Long-form content: Blog posts, documentation, technical articles
  • Context-dependent answers: When surrounding information is critical
  • Reduced hallucination: When accuracy matters more than speed
  • After vanilla RAG struggles: Natural upgrade path for better context
  • Code documentation: Where function context needs surrounding module info

Try Parent-Child RAG Yourself

Parent-Child RAG Chat

Parent-Child RAG: Hierarchical Chunking

Search small chunks, return large context-rich parents

Click the settings icon to configure child search and parent re-ranking. Best for documentation and long-form content.

The LLM can now provide a complete, actionable answer instead of a fragment.


Corrective RAG (CRAG)

Now let's explore Corrective RAG—a self-correcting technique that evaluates retrieved documents for relevance and takes corrective actions when retrieval quality is insufficient. This addresses a critical limitation: standard RAG blindly trusts whatever documents are retrieved.

The Problem with Blind Trust

Consider what happens in standard RAG when someone asks: "What is Arnab's experience with Kubernetes?"

If your vector database doesn't contain K8s-specific content, it will still return the "closest" documents—perhaps posts about Docker, cloud infrastructure, or general DevOps. Standard RAG will then confidently generate an answer based on irrelevant context, potentially K8s experience that doesn't exist.

The fundamental issue: RAG systems have no way to know when retrieval fails.

What is Corrective RAG?

adds a relevance evaluation layer between retrieval and generation. Based on the paper Corrective Retrieval Augmented Generation (Yan et al., 2024), it introduces:

  1. Relevance Evaluation: Grade each document as CORRECT, AMBIGUOUS, or INCORRECT
  2. Confidence Assessment: Calculate overall retrieval confidence (HIGH/MEDIUM/LOW)
  3. Corrective Actions: Take different actions based on confidence level
  4. Confidence-Aware Generation: Adapt the LLM prompt based on retrieval quality

In simple words, imagine a strict teacher grading the search results. If the results are good (Correct), the teacher hands them to the student (LLM). If they are mixed (Ambiguous), the teacher filters out the bad parts. If they are completely wrong (Incorrect), the teacher tells the student to look elsewhere (web search) or admit they don't know, instead of making things up.

How Corrective RAG Works

Stage 1: RetrievalUser Query"K8s exp?"1VectorSearchTop-K docsRetrieved DocsDoc 1Doc 2Doc 3Unknown relevanceStage 2: Evaluation2LLM Relevance EvaluatorGrade each documentCORRECTRelevantAMBIGUOUSPartialINCORRECTIrrelevantConfidence AssessmentHIGHUse directMEDIUMRefineLOWUncertaintyStage 3: Corrective Action3Action HandlerUSE_DIRECTPass CORRECT docs to LLMREFINEExtract knowledge stripsCOMMUNICATE_UNCERTAINTYHonest "I don't know"Stage 4: Generation4Confidence-Aware LLMPrompt adapted to confidence levelHIGH: "Answer confidently..."Use context directly, cite sourcesMEDIUM: "Context is partial..."Acknowledge limitations, focus relevantLOW: "Limited relevance..."Be honest, avoid speculation

The architecture has four distinct stages:

Stage 1 (Retrieval): Standard vector search retrieves documents—same as Vanilla RAG.

Stage 2 (Evaluation): An evaluates each document's relevance to the query:

  • CORRECT: Document directly answers the query
  • AMBIGUOUS: Related but insufficient
  • INCORRECT: Irrelevant to the query

Stage 3 (Corrective Action): Based on the overall confidence score:

  • HIGH → Use CORRECT documents directly
  • MEDIUM → Refine knowledge by extracting relevant strips
  • LOW → Communicate uncertainty honestly

Stage 4 (Generation): LLM generates with a confidence-aware prompt that adapts behavior based on retrieval quality.

The Algorithm

ts
1// Pseudocode: Corrective RAG (CRAG) 2async function correctiveRAG(userQuery) { 3 // 1. RETRIEVE 4 // Get initial documents just like Vanilla RAG 5 const documents = await vectorDB.search(userQuery); 6 7 // 2. EVALUATE (The "Self-Correction" Step) 8 // Ask an LLM to grade each document: "Is this ACTUALLY relevant?" 9 const grades = await llm.evaluateRelevance(userQuery, documents); 10 11 // 3. DECIDE & ACT 12 if (grades.confidence === 'HIGH') { 13 // If results are good, just use them to answer 14 return await llm.generate(userQuery, documents); 15 } else if (grades.confidence === 'MEDIUM') { 16 // If results are mixed, filter out the noise and keep only relevant facts 17 const refinedDocs = refineKnowledge(documents, grades); 18 return await llm.generate(userQuery, refinedDocs); 19 } else { 20 // If results are poor (LOW confidence), be honest 21 // We can either say "I don't know" or trigger a web search fallback 22 return "I couldn't find relevant information in the knowledge base."; 23 } 24}

view the full source code here

Pros and Cons

ProsCons
Detects and filters irrelevant documentsAdditional LLM calls for evaluation (latency)
Reduces hallucinations from bad retrievalHigher token usage per query
Provides honest uncertainty when knowledge is lackingEvaluation quality depends on prompt engineering
Confidence-aware prompts improve response qualityMay be overly conservative (false negatives)
Works on top of any base retrieval methodMore complex pipeline to maintain

Advanced Optimizations

The implementation includes several production-ready optimizations:

Knowledge Refinement: For MEDIUM confidence, documents are decomposed into , each classified by type (fact, context, example, definition) and relevance score. Only relevant strips are recomposed into the final context:

ts
1// Knowledge strip structure 2interface KnowledgeStrip { 3 content: string; 4 type: 'fact' | 'context' | 'example' | 'definition'; 5 relevanceScore: number; 6 sourceChunkId: string; 7}

Confidence-Aware Prompting: The system prompt adapts to retrieval quality:

  • HIGH: "Answer confidently, cite sources"
  • MEDIUM: "Acknowledge limitations, focus on relevant parts"
  • LOW: "Be honest about limitations, avoid speculation"

Quick Refine Mode: For faster processing, a simplified refinement path extracts only the relevant excerpts identified during evaluation, skipping full decomposition.

Other Potential Optimizations

While not implemented in this specific version, production systems often include:

Hybrid Evaluation: Use reranker scores for fast pre-filtering, then LLM evaluation only for ambiguous cases (saves ~50% of evaluation calls).

Evaluation Caching: Cache evaluation results for repeated queries to reduce LLM calls.

Adaptive Confidence Thresholds: Adjust HIGH/MEDIUM/LOW thresholds based on query complexity or domain.

When to Use Corrective RAG

  • High-stakes applications: When wrong answers are costly (medical, legal, financial)
  • Knowledge-gap detection: When you need to know if information exists
  • User trust: When users need to know the system's confidence level
  • Heterogeneous content: When document quality varies significantly
  • After other techniques fail: When you've optimized retrieval but still see hallucinations

Try Corrective RAG

Corrective RAG Chat

Corrective RAG: Self-Correcting Retrieval

Evaluates document relevance and adapts based on confidence

Click settings to configure retrieval and refinement. Uses LLM-as-judge to evaluate document relevance and take corrective action.


Graph RAG

Now let's explore Graph RAG—a technique that extends traditional RAG with knowledge graphs to capture entities, relationships, and multi-hop connections that flat document retrieval misses. This approach excels at understanding how concepts relate to each other.

The Problem with Flat Retrieval

Standard RAG treats documents as isolated chunks. When you ask: "What technologies did Arnab use in projects that won hackathons?"

A chunk-based approach struggles because:

  • The hackathon win is mentioned in one chunk
  • The project name might be in that chunk or another
  • The technologies used are described elsewhere

There's no explicit connection between these pieces. The embedding model hopes semantic similarity bridges the gap, but complex multi-hop queries often fail.

What is Graph RAG?

combines knowledge graph traversal with vector similarity search for comprehensive retrieval:

In simple words, standard vector search finds things that "sound similar" (like finding a book by its cover description), while Graph RAG finds things that are "connected" (like finding a book because the author also wrote your favorite novel). It connects the dots between different pieces of information that might not use the same words but are related.

ComponentWhat It CapturesQuery StrengthStorage
Entities, relationships, structureMulti-hop, relational queries (graph database)
Vector StoreSemantic similarity, contentConcept matching, fuzzy searchPinecone (vector database)

The key insight: traverse relationships when entities are explicitly connected, fall back to vectors when semantic similarity is needed, and fuse both for comprehensive coverage.

Relationship to Existing RAG

Graph RAG in this implementation builds on and reuses components from the standard RAG pipeline:

Shared Components (from normal RAG):

  • Embedding generation for vector search
  • Vector similarity search (Pinecone)
  • Cross-encoder re-ranking for precision
  • Reciprocal Rank Fusion for combining results

Graph-Specific Additions:

  • LLM-based entity extraction from queries
  • Graph node matching and relationship traversal
  • context formatting for the LLM

This modular design means Graph RAG enhances rather than replaces your existing RAG infrastructure.

How Graph RAG Works

User Query"AI projects..."Input1Entity ExtractGemini LLMFind entities2aGraph LookupNeo4j QueryMatch nodes2bTraversen-hop expansionFind relationships2cVector SearchPineconeSemantic match3RRF Fusion+ RerankMerge context4LLMGraph-awareResponse

The architecture has four main stages:

Stage 1 (): An LLM extracts entities from the user's query:

  • "What AI projects did Arnab build?"[Arnab Mondal (Person), AI (Concept)]

Stage 2a (Graph Lookup & Traversal): Query Neo4j to find matching nodes, then traverse relationships up to N hops:

cypher
1// Find Arnab and traverse to projects 2MATCH (p:Person {name: "Arnab Mondal"})-[:BUILT]->(proj:Project) 3WHERE (proj)-[:IMPLEMENTS]->(:Concept {name: "AI"}) 4RETURN proj

Stage 2b (Vector Search): In parallel, run standard vector similarity search for semantic matches.

Stage 3 (Fusion): Combine graph results (converted to pseudo-chunks) with vector results using Reciprocal Rank Fusion, then optionally re-rank.

Stage 4 (Generation): LLM generates with both graph context (entities, relationships) and document content.

The Algorithm

ts
1// Pseudocode: Graph RAG 2async function graphRAG(userQuery) { 3 // 1. ENTITY EXTRACTION 4 // Use an LLM to identify entities in the query 5 const entities = await llm.extractEntities(userQuery); 6 7 // 2. PARALLEL RETRIEVAL 8 const [graphContext, vectorChunks] = await Promise.all([ 9 // 2a. Graph: Find entities, traverse relationships 10 graphDB.retrieveSubgraph(entities, { maxHops: 2 }), 11 // 2b. Vector: Standard semantic search 12 vectorDB.search(userQuery), 13 ]); 14 15 // 3. FUSION 16 // Convert graph context to chunks, fuse with vector results 17 const graphChunks = graphContextToChunks(graphContext); 18 let fusedResults = reciprocalRankFusion([graphChunks, vectorChunks]); 19 20 // Optional: Re-rank for precision 21 fusedResults = await reranker.rerank(userQuery, fusedResults); 22 23 // 4. GENERATE 24 // LLM receives both relationship context and document content 25 return await llm.generate(userQuery, { 26 entities: graphContext.entities, 27 relationships: graphContext.relationships, 28 documents: fusedResults, 29 }); 30}

view the full source code here

Pros and Cons

ProsCons
Captures explicit relationships between entitiesRequires graph database (Neo4j) infrastructure
Handles multi-hop queries naturallyEntity extraction adds latency (~200ms)
Structured entity context for LLMGraph ingestion pipeline needed
Parallel graph + vector for coverage design requires domain knowledge
Reuses existing RAG components: graph must be populated

Advanced Optimizations

The implementation includes several production-ready optimizations:

Configurable Traversal Depth: Limit graph expansion with maxHops parameter (default: 2) to balance coverage and performance:

ts
1// Traverse up to 2 hops from matched entities 2const context = await retrieveFromGraph(query, { maxHops: 2 });

Parallel Retrieval: Graph and vector searches run concurrently to minimize latency:

ts
1const [graphContext, vectorChunks] = await Promise.allSettled([ 2 retrieveFromGraph(query, graphOptions), 3 vectorRetrieval(query, topK), 4]);

Graceful Degradation: If Neo4j is unavailable, falls back to vector-only search. If Pinecone fails, uses graph-only results.

Entity Type Filtering: The graph schema supports 17 entity types (Person, Project, Technology, Company, Skill, etc.) and 50+ relationship types, enabling precise queries.

Other Potential Optimizations

While not implemented in the current version, these could further improve performance:

Entity Caching: Cache entity extraction results for repeated query patterns.

Precomputed Subgraphs: For common query types, precompute relevant subgraphs during ingestion.

Hybrid Entity Matching: Combine exact name matching with fuzzy/embedding-based entity resolution.

When to Use Graph RAG

  • Knowledge-intensive domains: Where entities and relationships are well-defined
  • Multi-hop queries: Questions requiring traversal between connected concepts
  • Portfolio/resume applications: Projects, skills, companies, and their connections
  • Technical documentation: Dependencies, APIs, and their relationships
  • After building a knowledge graph: When you have structured entity data to leverage

Try Graph RAG

Graph RAG Chat

Graph RAG: Knowledge Graph + Vector Search

Great for relationship queries, entity connections, and multi-hop reasoning

Combines knowledge graph traversal (Neo4j) with vector search (Pinecone). Click settings to adjust graph traversal depth.


Agentic RAG

Now let's explore the most advanced RAG technique in this guide: Agentic RAG. This approach transforms RAG from a static pipeline into an autonomous agent that reasons about retrieval strategy, iterates when needed, and self-corrects its answers.

The Problem with Static RAG Pipelines

All the techniques we've covered so far share a common limitation: they follow a fixed execution path. Whether you're using Vanilla, Hybrid, or even Corrective RAG, the system executes the same pipeline for every query.

Consider these different questions:

  • "What's Arnab's name?" → Doesn't need retrieval at all
  • "What certifications does Arnab have?" → Needs precise keyword matching
  • "Compare frontend and backend experience" → Needs into sub-questions
  • "Tell me about AI/ML projects" → Needs broad, exploratory search

A static pipeline treats all these the same way. solves this by giving the system agency—the ability to reason, plan, and adapt.

What is Agentic RAG?

Agentic RAG combines the retrieval capabilities of traditional RAG with the reasoning and planning abilities of LLM agents. Instead of a fixed pipeline, an :

  1. Analyzes the query to determine the best approach
  2. Selects from multiple retrieval tools based on query characteristics
  3. Iterates when initial results are insufficient
  4. Decomposes complex questions into manageable sub-questions
  5. Verifies that answers are grounded in retrieved context
  6. Remembers conversation context across turns

This is one of my favorite RAG approaches, and I previously implemented it during my internship at Codemate AI for their web search and codebase context feature: Read about my Agentic RAG implementation at Codemate AI

Industry Adoption

Agentic RAG-style techniques are now widely used by modern coding agents and developer tools:

  • Cursor and other AI-powered code editors use agent loops to select retrieval strategies
  • VS Code coding agents employ tool selection for codebase search
  • Autonomous coding assistants use iterative retrieval to find relevant context

The pattern has proven effective for complex, multi-step reasoning tasks where a single retrieval pass often isn't enough.

How Agentic RAG Works

Agentic RAG: Autonomous Retrieval with Tool SelectionStage 1: Query AnalysisUser Query"Compare React vs Vue"1Query AnalyzerType: comparativeDecompose: trueStrategy: multi-queryLLM-powered planningConversation MemoryPrior turns contextEntity trackingMulti-turn awareStage 2: Tool Selection2Agent Tool SelectorAutonomous strategy selectionBased on query analysisAvailable Toolsvanilla_searchSemantichybrid_searchKeyword+Semanticrerank_searchHigh precisionmulti_queryQuery expansiongraph_traversalKnowledge graphentity_lookupDirect lookup10+ retrieval strategiesStage 3: Iterative Retrieval3Agent LoopMax 3 iterations1. Execute ToolCall selected retrieval strategy2. Evaluate ResultsSufficient? Relevant?3. If Insufficient:Try different toolReformulate queryretry4. Results SufficientProceed to verificationStage 4: Verification4Self-VerificationIs answer grounded?Any knowledge gaps?Confidence assessmentPrevents hallucinationStage 5: Generation5Gemini LLMContext + VerificationStreaming responseGrounded Answer

The architecture has five distinct stages:

Stage 1 (Query Analysis): The agent analyzes the incoming query to determine:

  • Query type (simple, factual, comparative, exploratory)
  • Whether decomposition into sub-questions would help
  • Recommended retrieval strategy
  • Relevant conversation history

Stage 2 (): Based on the analysis, the agent autonomously selects from available retrieval tools:

Query PatternSelected Tool
Conceptual questionsvanilla_search
Specific terms/nameshybrid_search
Precision-criticalrerank_search
Broad/exploratorymulti_query_search
Relationship queriesgraph_traversal_search
Entity lookupsentity_lookup

Stage 3 (Iterative Retrieval): The agent executes a retrieval loop:

  1. Execute the selected tool
  2. Evaluate results: Are they sufficient? Relevant?
  3. If insufficient, try a different tool or reformulate the query
  4. Repeat up to 3 iterations

Stage 4 (Verification): Before generating, the agent verifies:

  • Is the answer in retrieved context?
  • Are there knowledge gaps to acknowledge?
  • What confidence level is appropriate?

Stage 5 (Generation): Finally, the LLM generates a response using the accumulated context and verification results.

The Algorithm

ts
1// Pseudocode: Agentic RAG 2async function agenticRAG(userQuery, conversationHistory) { 3 // 1. ANALYZE 4 // The agent first reasons about the query 5 const analysis = await agent.analyzeQuery(userQuery); 6 7 // 2. PLAN 8 // Decide on retrieval strategy (or skip if not needed) 9 if (analysis.queryType === 'simple') { 10 return await llm.generate(userQuery); // No retrieval needed 11 } 12 13 // 3. ITERATIVE RETRIEVAL LOOP 14 let context = []; 15 for (let i = 0; i < MAX_ITERATIONS; i++) { 16 // Agent autonomously selects and calls the best tool 17 const toolResult = await agent.selectAndExecuteTool(userQuery, analysis); 18 context.push(...toolResult.chunks); 19 20 // Evaluate: Do we have enough? 21 if (agent.isResultSufficient(context, userQuery)) break; 22 23 // If not, agent decides: try different tool or reformulate 24 analysis.strategy = agent.suggestNextStrategy(); 25 } 26 27 // 4. VERIFY 28 // Self-check before answering 29 const verification = await agent.verifyGroundedness(context, userQuery); 30 31 // 5. GENERATE 32 // Answer with confidence-aware response 33 return await llm.generate(userQuery, context, verification); 34}

view the full source code here

Pros and Cons

ProsCons
Autonomous strategy selection per queryHigher latency (multiple LLM calls)
Handles diverse query types optimallyMore complex to implement and debug
Self-corrects when retrieval failsHigher token usage
Decomposes complex questionsAgent reasoning can be unpredictable
Reduces hallucination via verificationRequires careful prompt engineering
Multi-turn conversation awarenessHarder to optimize for specific use cases

Advanced Optimizations

The implementation includes several production-ready optimizations:

Tool Budgeting: Limits maximum iterations (default: 3) to prevent runaway agent loops and control costs:

ts
1stopWhen: stepCountIs(maxIterations + 2); // Allow tool calls + final response

Query-Aware Tool Selection: The agent system prompt includes a decision matrix mapping query patterns to optimal tools, reducing trial-and-error.

Conversation Memory: Uses the AI SDK's built-in message handling for multi-turn context, enabling queries like "What technologies did he use for it?" after discussing a specific project.

Streaming Response: Streams the final answer while tool calls execute in the background, improving perceived latency.

Other Potential Optimizations

While not implemented in the current version, these could further improve performance:

Parallel Tool Execution: For decomposed queries, execute sub-question retrievals in parallel.

Tool Result Caching: Cache results for repeated or similar queries within a session.

Confidence-Based Early Termination: Skip additional iterations when first result has very high relevance scores.

Query Classification Caching: Cache query analysis for similar query patterns.

When to Use Agentic RAG

  • Complex applications: Where query types vary significantly
  • Coding assistants: Multi-hop reasoning about codebases
  • Multi-turn conversations: When context matters across turns
  • High-stakes answers: Where self-verification is valuable
  • Exploratory interfaces: When users ask diverse question types
  • After other techniques plateau: When you've optimized retrieval but need smarter orchestration

Try Agentic RAG

Agentic RAG Chat

Agentic RAG: Autonomous Retrieval

The agent selects tools, iterates, and self-corrects

Click settings to enable/disable retrieval tools. The agent autonomously selects, iterates, and self-corrects. Demo may timeout due to long retrieval times and Vercel free tier function limits.


Conclusion

RAG has evolved from simple vector search to sophisticated autonomous systems. Each technique we explored builds on the previous:

  • Vanilla RAG establishes the foundation with semantic similarity
  • Hybrid RAG combines dense and sparse retrieval for better coverage
  • Re-ranking RAG adds precision through cross-encoder scoring
  • Multi-Query RAG improves recall through query expansion
  • Parent-Child RAG balances retrieval precision with context richness
  • Corrective RAG introduces self-evaluation and confidence awareness
  • Graph RAG captures entity relationships through knowledge graph traversal
  • Agentic RAG brings autonomous reasoning and adaptive tool selection

The best RAG system for your use case depends on your specific requirements: latency constraints, accuracy needs, query diversity, and available infrastructure. Start with Vanilla RAG to establish a baseline, then layer in techniques based on where you see gaps.

Once you have a RAG system in production, keeping it fresh becomes the next challenge. Check out my guide on keeping RAG up-to-date with Change Data Capture (CDC) to learn how to stream changes instead of batch re-indexing.

Thank you for reading! I hope these interactive demos helped you understand the nuances of each approach. If you have questions, want to discuss RAG techniques further, or are interested in hiring or collaboration, feel free to reach out at hire@codewarnab.in.

Interactive Guide to RAG Techniques: From Vanilla to Agentic | Arnab Mondal - CodeWarnab