Interactive Guide to RAG Techniques: From Vanilla to Agentic

Overview

What is RAG?
What is a Vector Database?
Naive RAG (Vanilla RAG)
Hybrid RAG
Re-ranking RAG
Multi-Query RAG
Parent-Child RAG (Hierarchical Chunking)
Corrective RAG (CRAG)
Graph RAG
Agentic RAG
Conclusion

Ever wondered how ChatGPT-like systems can answer questions about your own documents? The answer is RAG (Retrieval-Augmented Generation). In this interactive guide, you won't just read about RAG—you'll experience it firsthand by chatting with different RAG implementations, all powered by my blog posts and portfolio content. We are going to explore all the RAG techniques today.

What is RAG?

RAG (Retrieval-Augmented Generation) is a game-changing pattern that solves one of the biggest limitations of Large Language Models (LLMs): they don't know about your private data.

Think about it—ChatGPT was trained on internet data up to a certain date. It doesn't know about:

Your company's internal documents
Recent blog posts you've written
The specific codebase of your project

RAG bridges this gap by retrieving relevant context from your own data and feeding it to the LLM along with the user's question.

The RAG Pipeline (High-Level)

What is a Vector Database?

Before diving into RAG techniques, let's understand the backbone of any RAG system: Vector Databases.

The Problem with Traditional Search

Traditional databases are great at exact matches:

SELECT * FROM posts WHERE title = "React hooks"
Find user with email = "arnab@example.com"

But what if someone asks: "What did Arnab write about frontend frameworks?"

The word "React" might not appear anywhere, but your content about React, Vue, or Next.js is absolutely relevant. This is where semantic search comes in.

Embeddings: Words as Numbers

Embeddings convert text into numerical vectors (arrays of numbers). The magic? Similar meanings produce similar vectors.

// Example embeddings (simplified to 3D for visualization)
"I love programming" → [0.8, 0.2, 0.5]
"Coding is my passion" → [0.75, 0.25, 0.48]  // Very similar!
"I hate vegetables" → [0.1, 0.9, 0.3]        // Very different!

In reality, embeddings have 768 to 3072 dimensions, capturing nuanced semantic relationships.

Vector Search: Finding Similar Meanings

A vector database like Pinecone stores these embeddings and enables lightning-fast similarity searches using algorithms like:

Cosine Similarity: Measures the angle between vectors
Euclidean Distance: Measures straight-line distance
Dot Product: Combines magnitude and direction

When you query "frontend frameworks", the database finds vectors closest to your query's embedding—even if the stored documents say "React", "Vue", or "Next.js".

Naive RAG (Vanilla RAG)

Now let's dive into the first and simplest RAG technique: Naive RAG (also called Vanilla RAG). In simple terms, Naive RAG is like an open-book exam. When you ask a question, the system searches your documents for the most relevant parts. It then gives those parts to the AI as "cheat sheets" so it can answer based on your specific data instead of just guessing.

How Vanilla RAG Works

The Algorithm

// Pseudocode: Naive RAG
async function naiveRAG(userQuery) {
  // 1. EMBED QUERY
  // Convert the user's question into a mathematical vector
  const queryEmbedding = await embeddingModel.embed(userQuery);

  // 2. RETRIEVE
  // Find the top 'k' most similar pieces of content from our database
  const relevantChunks = await vectorDatabase.search(queryEmbedding, {
    limit: 5,
  });

  // 3. AUGMENT
  // Combine the retrieved content into a single context block
  const context = relevantChunks.map((chunk) => chunk.text).join('\n');

  // Create a prompt that includes both the context and the question
  const prompt = buildPrompt(userQuery, context);

  // 4. GENERATE
  // Feed everything to the LLM to get the final answer
  return await llm.generate(prompt);
}

Pros and Cons

Pros	Cons
Simple to implement	No query understanding
Fast retrieval	Fixed chunk boundaries
Works well for straightforward questions	May miss relevant context
Low computational overhead	Struggles with multi-hop reasoning
Great starting baseline	No semantic ranking

When to Use Vanilla RAG

Prototyping: Quick proof-of-concept
Simple Q&A: Direct questions with clear answers in documents
Resource constraints: When you need minimal latency and cost
Baseline comparison: Before trying advanced techniques

Try Vanilla RAG Yourself

Vanilla RAG Chat

Ask anything about Arnab's portfolio

Try clicking one of the suggestions below

Hybrid RAG

Now let's explore a more powerful technique: Hybrid RAG. This approach addresses a fundamental limitation of Vanilla RAG by combining the best of both worlds—semantic understanding AND exact keyword matching.

The Problem with Vanilla RAG

Remember how Vanilla RAG uses only dense embeddings for retrieval? While semantic search is great at understanding meaning, it has a critical weakness:

It struggles with specific terms, names, and technical jargon.

Consider this query: "What projects use Next.js 14?"

Dense retrieval might find content about "modern React frameworks" or "server-side rendering"—semantically related, but missing the exact version.
A document mentioning "Utilizes Next.js 14 app router" might rank lower than "Introduction to React frameworks" because the embedding model doesn't weight "14" as heavily as the semantic meaning.

This is where Hybrid RAG shines.

What is Hybrid RAG?

Hybrid RAG combines dense retrieval (semantic embeddings) with sparse retrieval (keyword matching like BM25) to overcome the limitations of using either approach alone.

Think of it like having two search experts:

Expert A (Dense): Understands the meaning and intent behind your question
Expert B (Sparse): Excellent at finding exact matches for specific terms

Hybrid RAG asks both experts and intelligently combines their answers.

Want to understand how dense retrieval works under the hood? Check out my deep dive on Dense Passage Retrieval (DPR).

How Hybrid RAG Works

The Algorithm

To combine the results, Hybrid RAG typically uses Reciprocal Rank Fusion (RRF). This algorithm, detailed in the paper Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods, works by ranking documents based on their position in the search results rather than their raw confidence scores. This ensures a fair comparison between the disparate scoring scales of dense and sparse retrieval.

// Pseudocode: Hybrid RAG
async function hybridRAG(userQuery) {
  // 1. PARALLEL RETRIEVAL
  // Run two different search strategies simultaneously
  const [denseResults, sparseResults] = await Promise.all([
    // Strategy A: Understands meaning (Vector Search)
    vectorDB.search(userQuery),
    // Strategy B: Matches exact keywords (BM25)
    keywordDB.search(userQuery),
  ]);

  // 2. FUSION (Reciprocal Rank Fusion)
  // Combine results by respecting the ranking order of each system
  // This evens out the playing field between different scoring methods
  const fusedResults = reciprocalRankFusion(denseResults, sparseResults);

  // 3. GENERATE
  // Use the best results from the combined list to answer
  const topResults = fusedResults.slice(0, 5);
  return await llm.generateWithContext(userQuery, topResults);
}

// Helper: RRF Ranking
// Simply puts: "If a document is ranked high in BOTH lists, it wins."
function reciprocalRankFusion(listA, listB) {
  // ...implementation details...
  return combinedRankedList;
}

Pros and Cons

Pros	Cons
Best of both semantic and keyword search	More complex infrastructure
Handles specific terms and jargon well	Two search indices to maintain
More robust to vocabulary mismatch	Slightly higher latency (parallel helps)
Better recall for diverse query types	Fusion algorithm tuning required
Industry-proven approach	Higher resource usage

When to Use Hybrid RAG

Technical documentation: Exact API names, version numbers, function names matter
Legal/medical content: Precise terminology is critical
E-commerce: Product codes, model numbers, brand names
Enterprise search: Internal jargon, project names, acronyms
When Vanilla RAG isn't cutting it: Upgrade path when semantic-only fails

Try Hybrid RAG Yourself

Hybrid RAG Chat

Hybrid RAG: Semantic + Keyword Search

Great for specific terms, versions, and technical queries

Combines dense (semantic) + sparse (BM25) search. Click the settings icon to adjust the balance between keyword precision and semantic understanding.

Re-ranking RAG

Now let's level up with Re-ranking RAG—a technique that adds a precision layer on top of your retrieval. This is often the single biggest improvement you can make to RAG quality without changing your data or embeddings.

The Problem with Single-Stage Retrieval

Both Vanilla and Hybrid RAG use bi-encoders for retrieval—models that encode queries and documents independently:

// Bi-Encoder approach: Independent encoding
const queryVector = await embed(query); // Query → [0.1, 0.2, ...]
const docVector = await embed(document); // Doc → [0.3, 0.4, ...]
const score = cosineSimilarity(queryVector, docVector); // Compare vectors

The fundamental limitation: There's no direct interaction between query and document tokens. The model can't understand nuanced relationships like:

"What Arnab did NOT do" vs "What Arnab did"
"AWS certification" vs "uses AWS for deployment"
"internship at AI company" vs "worked on AI projects"

This leads to false positives—documents that are semantically similar but not actually relevant to the specific question.

What is Re-ranking RAG?

Re-ranking RAG adds a second-stage precision filter using a cross-encoder:

In simple words, it's like Googling something and then carefully reading the top 50 results to pick the absolute best 5 answers, instead of just trusting the search engine's top 5 blindly. First, you retrieve a lot of results fast (Retrieval), then you re-score them carefully (Re-ranking).

Stage	Model Type	Speed	Precision	Purpose
Stage 1	Bi-Encoder	⚡ Fast	Medium	Cast wide net (high recall)
Stage 2	Cross-Encoder	🐢 Slower	High	Precision filtering

The key insight: Cross-encoders process query and document together:

// Cross-Encoder approach: Joint encoding
// [CLS] query tokens [SEP] document tokens [SEP] → Transformer → relevance score
const score = await crossEncoder.score(query, document); // 0.0 to 1.0

[CLS] (Classification): Added at the very beginning. The model uses the final state of this token to represent the entire input pair (Query + Document) and predict the relevance score.
[SEP] (Separator): Added between the Query and the Document to tell the model where one ends and the other begins (and also at the end).

With full attention between query and document tokens, cross-encoders capture semantic relationships, negations, and context that bi-encoders miss.

How Re-ranking RAG Works

The Algorithm

// Pseudocode: Re-ranking RAG
async function rerankRAG(userQuery) {
  // 1. INITIAL RETRIEVAL (Recall Phase)
  // Cast a wide net to get many potential candidates (e.g., top 50)
  // Fast, but might include irrelevant results
  const candidates = await vectorDB.search(userQuery, { k: 50 });

  // 2. RE-RANKING (Precision Phase)
  // Use a smarter, slower model to score each candidate against the query
  // "How relevant is this SPECIFIC document to this SPECIFIC question?"
  const rerankedResults = await crossEncoder.rerank({
    query: userQuery,
    documents: candidates,
    top: 5,
  });

  // 3. GENERATE
  // Feed only the high-quality, re-ranked results to the LLM
  return await llm.generateWithContext(userQuery, rerankedResults);
}

Which Re-ranker Model?

This implementation uses Cohere's rerank-v4.0-fast, chosen for:

Low latency (~100-200ms) suitable for real-time applications
High quality comparable to larger models
Multilingual support for diverse content
128K context handles long documents
Free tier (1000 calls/month) for experimentation

Other options include Cohere's rerank-v4.0-pro (higher quality, higher latency), Jina AI Re-ranker, Voyage AI rerank-2, or self-hosted models like cross-encoder/ms-marco-MiniLM-L-6-v2.

Pros and Cons

Pros	Cons
Significantly higher precision	Additional latency (100-300ms)
Understands negations and context	Extra API cost per query
Works on top of existing retrieval	Limited by candidate quality
No re-indexing required	More complex pipeline
Often the biggest single improvement	Diminishing returns past ~50 candidates

Advanced Optimizations

The implementation in this demo includes several production-ready optimizations:

Score Fusion: Combines the original retrieval score with the re-rank score for nuanced ranking:

// Weighted combination of bi-encoder and cross-encoder scores
const finalScore = 0.7 * rerankScore + 0.3 * originalScore;

Adaptive Re-ranking: Skips re-ranking when the initial retrieval confidence is already very high (e.g. top result score > 0.95 with clear separation).

Query-Aware Over-retrieval: Retrieves more candidates for complex/vague queries and fewer for specific ones to balance recall and latency.

Other Potential Optimizations

While not implemented in this specific demo, production systems often include:

Deduplication: Removing near-duplicate chunks before re-ranking to maximize candidate diversity.

Caching: Caching re-ranking results for repeated queries to reduce API costs and latency.

When to Use Re-ranking RAG

Precision-critical applications: When wrong answers are costly
Complex queries: Multi-faceted questions requiring nuanced understanding
After optimizing retrieval: When you've maxed out embedding/chunking improvements
Technical content: Where subtle differences in wording matter greatly
Production RAG systems: The standard for quality-focused deployments

Try Re-ranking RAG Yourself

Re-ranking RAG Chat

Re-ranking RAG: Two-Stage Precision Retrieval

Cross-encoder re-ranking for higher precision answers

Two-stage retrieval: Click the settings icon to tune candidate count and final results. Uses Cohere cross-encoder for precision re-ranking.

Multi-Query RAG

Now let's explore Multi-Query RAG—a technique that dramatically improves recall by reformulating your question into multiple diverse queries. This addresses a fundamental limitation: a single query often misses relevant documents due to vocabulary mismatch.

The Problem with Single-Query Retrieval

Consider a user asking: "What machine learning projects has Arnab worked on?"

Even with the best embeddings, a single query might miss:

Documents using "AI" instead of "machine learning"
Content about "neural networks" or "deep learning"
Projects mentioning "TensorFlow" or "PyTorch" without saying "ML"
Experience with "NLP" or "computer vision"

The embedding model does its best, but semantic similarity has limits—especially with technical jargon, acronyms, and domain-specific terms.

What is Multi-Query RAG?

Multi-Query RAG uses an LLM to generate alternative phrasings of the user's question, then retrieves documents for all variations in parallel. The results are combined using Reciprocal Rank Fusion (RRF) to create a comprehensive, diverse result set.

Think of it like asking the same question in multiple ways to different search engines, then combining the best answers from each.

How Multi-Query RAG Works

The Algorithm

// Pseudocode: Multi-Query RAG
async function multiQueryRAG(originalQuery) {
  // 1. QUERY EXPANSION
  // Ask an LLM to generate 4-5 different ways to ask the same question
  // This helps catch documents that use different terminology
  const queryVariations = await llm.generateVariations(originalQuery);

  // 2. PARALLEL RETRIEVAL
  // Search for ALL variations at once
  const allResults = await Promise.all(queryVariations.map((query) => vectorDB.search(query)));

  // 3. FUSION
  // Combine results. Documents that appear in multiple searches get a higher score.
  // (We use Reciprocal Rank Fusion here)
  const fusedResults = reciprocalRankFusion(allResults);

  // 4. GENERATE
  // Answer using the uniquely robust set of documents
  return await llm.generateWithContext(originalQuery, fusedResults);
}

Pros and Cons

Pros	Cons
Significantly higher recall	LLM call adds latency (~100-300ms)
Handles vocabulary mismatch	Multiplies embedding calls (4-5x)
Documents "voted" by multiple queries rank higher	More API costs per query
Captures different perspectives of the question	Query generation quality matters
Works with any base retrieval method	Diminishing returns past 5-6 variations

Advanced Optimizations

The implementation in this demo includes several production-ready optimizations:

Deduplication by Similarity: Removes near-duplicate chunks that appear across different query results. Adjacent chunks from the same document are merged to maximize diversity:

// Skip chunks that are adjacent in the same document
if (existing.slug === candidate.slug && Math.abs(existing.chunkIndex - candidate.chunkIndex) <= 1) {
  // Likely overlapping content - skip
}

Weighted Score Fusion: An alternative to pure RRF that considers the original similarity scores. Useful when you trust the embedding quality:

// Weight original scores instead of just ranks
const finalScore = weight * originalScore;

Hybrid Base Retrieval: Can use either vanilla (dense-only) or hybrid (dense + sparse) retrieval as the base for each query variation, combining the benefits of both approaches.

Re-ranking After Fusion: Optionally apply cross-encoder re-ranking to the fused results. This combines the high recall of multi-query with the precision of re-ranking, and is especially powerful when fused results contain many "voted" documents that need precision ranking.

Other Potential Optimizations

While not implemented in this specific demo, production systems often include:

Query Caching: Cache generated query variations for repeated or similar queries to reduce LLM calls.

Adaptive Query Count: Generate fewer variations for specific queries and more for vague or exploratory ones.

Query Quality Filtering: Score generated queries and drop low-quality variations before retrieval.

When to Use Multi-Query RAG

Exploratory queries: Broad questions that could match many document types
Domain-specific content: Technical jargon where synonyms matter
Comprehensive answers: When you need to cover all angles of a topic
Recall-critical applications: When missing relevant documents is costly
Before re-ranking: Generate diverse candidates, then use cross-encoder to precision-rank

Try Multi-Query RAG Yourself

Multi-Query RAG Chat

Multi-Query RAG: Diverse Query Expansion

LLM generates query variations for comprehensive coverage

Click the settings icon to configure query variations, retrieval method, and re-ranking. Results are fused with Reciprocal Rank Fusion (RRF).

Parent-Child RAG (Hierarchical Chunking)

Now let's explore Parent-Child RAG—a technique that solves the fundamental tension between retrieval precision and generation quality by using two levels of chunks: small chunks for searching, large chunks for answering.

The Problem with Fixed Chunk Sizes

Consider how we chunk documents in standard RAG:

Chunk Size	Retrieval Quality	Generation Quality
Small (200-300 tokens)	High precision - focused embeddings	Poor - lost context, fragmented info
Large (1000+ tokens)	Low precision - diluted embeddings	Good - rich context for LLM

The dilemma: Small chunks give precise retrieval but poor context. Large chunks give rich context but imprecise retrieval. You can't win with a single chunk size.

What is Parent-Child RAG?

Parent-Child RAG uses two chunk sizes to get the best of both worlds:

In simple words, you search using small detailed snippets to find specific matches, but you give the AI a big, complete chunk of text (the parent) so it understands the full context and can answer correctly.

Level	Size	Purpose	Storage
Child	200-400 tokens	Retrieval (precise vector matching)	Vector index (searchable)
Parent	800-1500 tokens	Generation (context for LLM)	Metadata (returned, not searched)

The key insight: search on small, focused chunks but return large, context-rich chunks to the LLM.

How Parent-Child RAG Works

The architecture has two distinct phases:

Ingestion Time (Steps 1-3):

Split document into large parent chunks (~1000 tokens)
Split each parent into small child chunks (~200 tokens)
Embed only child chunks, storing parentId and parentContent in metadata

Query Time (Steps 4-5): 4. Search child vectors, find matches, look up parent via parentId 5. Return deduplicated parent chunks to LLM for generation

The Algorithm

// Pseudocode: Parent-Child RAG
async function parentChildRAG(userQuery) {
  // 1. SEARCH SMALL (Precision)
  // Search against small "child" chunks (e.g., 200 words)
  // These are great for matching specific details in the query
  const childMatches = await vectorDB.search(userQuery);

  // 2. RETRIEVE LARGE (Context)
  // Instead of using the small chunk, fetch its "parent" chunk (e.g., 1000 words)
  // This provides the LLM with the full surrounding context
  const parentChunks = matchesToParents(childMatches);

  // 3. DEDUPLICATE
  // If multiple child chunks point to the same parent, only use the parent once
  const uniqueParents = unique(parentChunks);

  // 4. GENERATE
  // Answer using the rich, full-context parent chunks
  return await llm.generateWithContext(userQuery, uniqueParents);
}

Example: Why Parent Context Matters

Query: "Einstein 1905 theory"

Child Match (200 tokens):

text

He published relativity in 1905...

Parent Returned (1000 tokens):

text

Einstein Biography - Chapter 3

Albert Einstein was born in Ulm, Germany in 1879. After struggling
in traditional schooling, he found his passion in physics and
mathematics at the Swiss Federal Polytechnic.

In 1905, his 'Annus Mirabilis' (miracle year), Einstein published
four groundbreaking papers on the photoelectric effect, Brownian
motion, special relativity, and mass-energy equivalence.

The special relativity paper introduced E=mc² and fundamentally
changed our understanding of space and time...

The LLM now has full biographical context—not just the 1905 mention, but Einstein's background, education, and the significance of his discoveries.

Pros and Cons

Pros	Cons
Precise retrieval with focused embeddings	Increased storage (parent content in metadata)
Rich context for coherent LLM responses	Slightly more complex ingestion pipeline
No mid-sentence cuts - parents are semantically complete	Metadata size limits (Pinecone: 40KB)
Reduced hallucination - more context = less guessing	Fixed parent boundaries may split related content
Deduplication handles multiple child matches naturally	More embedding calls during ingestion

Advanced Optimizations

The implementation includes several production-ready optimizations:

Metadata-Based Storage: Parent content is stored directly in child chunk metadata, eliminating the need for a separate storage system or secondary lookups:

// Each child vector includes its parent in metadata
{
  id: "child-blog-post-0-2",
  vector: [...childEmbedding],
  metadata: {
    content: "...child content...",
    parentId: "parent-blog-post-0",
    parentContent: "...full parent content (~1000 tokens)...",
  }
}

Parent Deduplication: When multiple children from the same parent match the query, the parent is returned only once with the best child's score. This prevents context duplication and maximizes diversity.

Optional Re-ranking: After parent deduplication, a cross-encoder can re-rank the parent chunks for even higher precision—combining the high recall of hierarchical retrieval with the accuracy of re-ranking.

Other Potential Optimizations

While not implemented in the current version, these optimizations could further improve performance:

Sliding Window Parents: Create parents with overlap to ensure related content at boundaries isn't split awkwardly.

Dynamic Parent Sizing: Adjust parent boundaries based on content structure (e.g., respect section headers).

Multi-Level Hierarchy: For very long documents, add a grandparent level (Grandparent → Parent → Child).

When to Use Parent-Child RAG

Long-form content: Blog posts, documentation, technical articles
Context-dependent answers: When surrounding information is critical
Reduced hallucination: When accuracy matters more than speed
After vanilla RAG struggles: Natural upgrade path for better context
Code documentation: Where function context needs surrounding module info

Try Parent-Child RAG Yourself

Parent-Child RAG Chat

Parent-Child RAG: Hierarchical Chunking

Search small chunks, return large context-rich parents

Click the settings icon to configure child search and parent re-ranking. Best for documentation and long-form content.

The LLM can now provide a complete, actionable answer instead of a fragment.

Corrective RAG (CRAG)

Now let's explore Corrective RAG—a self-correcting technique that evaluates retrieved documents for relevance and takes corrective actions when retrieval quality is insufficient. This addresses a critical limitation: standard RAG blindly trusts whatever documents are retrieved.

Consider what happens in standard RAG when someone asks: "What is Arnab's experience with Kubernetes?"

If your vector database doesn't contain K8s-specific content, it will still return the "closest" documents—perhaps posts about Docker, cloud infrastructure, or general DevOps. Standard RAG will then confidently generate an answer based on irrelevant context, potentially hallucinating K8s experience that doesn't exist.

The fundamental issue: RAG systems have no way to know when retrieval fails.

What is Corrective RAG?

Corrective RAG (CRAG) adds a relevance evaluation layer between retrieval and generation. Based on the paper Corrective Retrieval Augmented Generation (Yan et al., 2024), it introduces:

Relevance Evaluation: Grade each document as CORRECT, AMBIGUOUS, or INCORRECT
Confidence Assessment: Calculate overall retrieval confidence (HIGH/MEDIUM/LOW)
Corrective Actions: Take different actions based on confidence level
Confidence-Aware Generation: Adapt the LLM prompt based on retrieval quality

In simple words, imagine a strict teacher grading the search results. If the results are good (Correct), the teacher hands them to the student (LLM). If they are mixed (Ambiguous), the teacher filters out the bad parts. If they are completely wrong (Incorrect), the teacher tells the student to look elsewhere (web search) or admit they don't know, instead of making things up.

How Corrective RAG Works

The architecture has four distinct stages:

Stage 1 (Retrieval): Standard vector search retrieves top-K documents—same as Vanilla RAG.

Stage 2 (Evaluation): An LLM-as-judge evaluates each document's relevance to the query:

CORRECT: Document directly answers the query
AMBIGUOUS: Related but insufficient
INCORRECT: Irrelevant to the query

Stage 3 (Corrective Action): Based on the overall confidence score:

HIGH → Use CORRECT documents directly
MEDIUM → Refine knowledge by extracting relevant strips
LOW → Communicate uncertainty honestly

Stage 4 (Generation): LLM generates with a confidence-aware prompt that adapts behavior based on retrieval quality.

The Algorithm

// Pseudocode: Corrective RAG (CRAG)
async function correctiveRAG(userQuery) {
  // 1. RETRIEVE
  // Get initial documents just like Vanilla RAG
  const documents = await vectorDB.search(userQuery);

  // 2. EVALUATE (The "Self-Correction" Step)
  // Ask an LLM to grade each document: "Is this ACTUALLY relevant?"
  const grades = await llm.evaluateRelevance(userQuery, documents);

  // 3. DECIDE & ACT
  if (grades.confidence === 'HIGH') {
    // If results are good, just use them to answer
    return await llm.generate(userQuery, documents);
  } else if (grades.confidence === 'MEDIUM') {
    // If results are mixed, filter out the noise and keep only relevant facts
    const refinedDocs = refineKnowledge(documents, grades);
    return await llm.generate(userQuery, refinedDocs);
  } else {
    // If results are poor (LOW confidence), be honest
    // We can either say "I don't know" or trigger a web search fallback
    return "I couldn't find relevant information in the knowledge base.";
  }
}

Pros and Cons

Pros	Cons
Detects and filters irrelevant documents	Additional LLM calls for evaluation (latency)
Reduces hallucinations from bad retrieval	Higher token usage per query
Provides honest uncertainty when knowledge is lacking	Evaluation quality depends on prompt engineering
Confidence-aware prompts improve response quality	May be overly conservative (false negatives)
Works on top of any base retrieval method	More complex pipeline to maintain

Advanced Optimizations

The implementation includes several production-ready optimizations:

Knowledge Refinement: For MEDIUM confidence, documents are decomposed into atomic knowledge strips, each classified by type (fact, context, example, definition) and relevance score. Only relevant strips are recomposed into the final context:

// Knowledge strip structure
interface KnowledgeStrip {
  content: string;
  type: 'fact' | 'context' | 'example' | 'definition';
  relevanceScore: number;
  sourceChunkId: string;
}

Confidence-Aware Prompting: The system prompt adapts to retrieval quality:

HIGH: "Answer confidently, cite sources"
MEDIUM: "Acknowledge limitations, focus on relevant parts"
LOW: "Be honest about limitations, avoid speculation"

Quick Refine Mode: For faster processing, a simplified refinement path extracts only the relevant excerpts identified during evaluation, skipping full decomposition.

Other Potential Optimizations

While not implemented in this specific version, production systems often include:

Hybrid Evaluation: Use reranker scores for fast pre-filtering, then LLM evaluation only for ambiguous cases (saves ~50% of evaluation calls).

Evaluation Caching: Cache evaluation results for repeated queries to reduce LLM calls.

Adaptive Confidence Thresholds: Adjust HIGH/MEDIUM/LOW thresholds based on query complexity or domain.

When to Use Corrective RAG

High-stakes applications: When wrong answers are costly (medical, legal, financial)
Knowledge-gap detection: When you need to know if information exists
User trust: When users need to know the system's confidence level
Heterogeneous content: When document quality varies significantly
After other techniques fail: When you've optimized retrieval but still see hallucinations

Try Corrective RAG

Corrective RAG Chat

Corrective RAG: Self-Correcting Retrieval

Evaluates document relevance and adapts based on confidence

Click settings to configure retrieval and refinement. Uses LLM-as-judge to evaluate document relevance and take corrective action.

Graph RAG

Now let's explore Graph RAG—a technique that extends traditional RAG with knowledge graphs to capture entities, relationships, and multi-hop connections that flat document retrieval misses. This approach excels at understanding how concepts relate to each other.

The Problem with Flat Retrieval

Standard RAG treats documents as isolated chunks. When you ask: "What technologies did Arnab use in projects that won hackathons?"

A chunk-based approach struggles because:

The hackathon win is mentioned in one chunk
The project name might be in that chunk or another
The technologies used are described elsewhere

There's no explicit connection between these pieces. The embedding model hopes semantic similarity bridges the gap, but complex multi-hop queries often fail.

What is Graph RAG?

Graph RAG combines knowledge graph traversal with vector similarity search for comprehensive retrieval:

In simple words, standard vector search finds things that "sound similar" (like finding a book by its cover description), while Graph RAG finds things that are "connected" (like finding a book because the author also wrote your favorite novel). It connects the dots between different pieces of information that might not use the same words but are related.

Component	What It Captures	Query Strength	Storage
Knowledge Graph	Entities, relationships, structure	Multi-hop, relational queries	Neo4j (graph database)
Vector Store	Semantic similarity, content	Concept matching, fuzzy search	Pinecone (vector database)

The key insight: traverse relationships when entities are explicitly connected, fall back to vectors when semantic similarity is needed, and fuse both for comprehensive coverage.

Relationship to Existing RAG

Graph RAG in this implementation builds on and reuses components from the standard RAG pipeline:

Shared Components (from normal RAG):

Embedding generation for vector search
Vector similarity search (Pinecone)
Cross-encoder re-ranking for precision
Reciprocal Rank Fusion for combining results

Graph-Specific Additions:

LLM-based entity extraction from queries
Graph node matching and relationship traversal
Subgraph context formatting for the LLM

This modular design means Graph RAG enhances rather than replaces your existing RAG infrastructure.

How Graph RAG Works

The architecture has four main stages:

Stage 1 (Entity Extraction): An LLM extracts entities from the user's query:

"What AI projects did Arnab build?" → [Arnab Mondal (Person), AI (Concept)]

Stage 2a (Graph Lookup & Traversal): Query Neo4j to find matching nodes, then traverse relationships up to N hops:

cypher

// Find Arnab and traverse to projects
MATCH (p:Person {name: "Arnab Mondal"})-[:BUILT]->(proj:Project)
WHERE (proj)-[:IMPLEMENTS]->(:Concept {name: "AI"})
RETURN proj

Stage 2b (Vector Search): In parallel, run standard vector similarity search for semantic matches.

Stage 3 (Fusion): Combine graph results (converted to pseudo-chunks) with vector results using Reciprocal Rank Fusion, then optionally re-rank.

Stage 4 (Generation): LLM generates with both graph context (entities, relationships) and document content.

The Algorithm

// Pseudocode: Graph RAG
async function graphRAG(userQuery) {
  // 1. ENTITY EXTRACTION
  // Use an LLM to identify entities in the query
  const entities = await llm.extractEntities(userQuery);

  // 2. PARALLEL RETRIEVAL
  const [graphContext, vectorChunks] = await Promise.all([
    // 2a. Graph: Find entities, traverse relationships
    graphDB.retrieveSubgraph(entities, { maxHops: 2 }),
    // 2b. Vector: Standard semantic search
    vectorDB.search(userQuery),
  ]);

  // 3. FUSION
  // Convert graph context to chunks, fuse with vector results
  const graphChunks = graphContextToChunks(graphContext);
  let fusedResults = reciprocalRankFusion([graphChunks, vectorChunks]);

  // Optional: Re-rank for precision
  fusedResults = await reranker.rerank(userQuery, fusedResults);

  // 4. GENERATE
  // LLM receives both relationship context and document content
  return await llm.generate(userQuery, {
    entities: graphContext.entities,
    relationships: graphContext.relationships,
    documents: fusedResults,
  });
}

Pros and Cons

Pros	Cons
Captures explicit relationships between entities	Requires graph database (Neo4j) infrastructure
Handles multi-hop queries naturally	Entity extraction adds latency (~200ms)
Structured entity context for LLM	Graph ingestion pipeline needed
Parallel graph + vector for coverage	Schema design requires domain knowledge
Reuses existing RAG components	Cold-start: graph must be populated

Advanced Optimizations

The implementation includes several production-ready optimizations:

Configurable Traversal Depth: Limit graph expansion with maxHops parameter (default: 2) to balance coverage and performance:

// Traverse up to 2 hops from matched entities
const context = await retrieveFromGraph(query, { maxHops: 2 });

Parallel Retrieval: Graph and vector searches run concurrently to minimize latency:

const [graphContext, vectorChunks] = await Promise.allSettled([
  retrieveFromGraph(query, graphOptions),
  vectorRetrieval(query, topK),
]);

Graceful Degradation: If Neo4j is unavailable, falls back to vector-only search. If Pinecone fails, uses graph-only results.

Entity Type Filtering: The graph schema supports 17 entity types (Person, Project, Technology, Company, Skill, etc.) and 50+ relationship types, enabling precise queries.

Other Potential Optimizations

While not implemented in the current version, these could further improve performance:

Entity Caching: Cache entity extraction results for repeated query patterns.

Precomputed Subgraphs: For common query types, precompute relevant subgraphs during ingestion.

Hybrid Entity Matching: Combine exact name matching with fuzzy/embedding-based entity resolution.

When to Use Graph RAG

Knowledge-intensive domains: Where entities and relationships are well-defined
Multi-hop queries: Questions requiring traversal between connected concepts
Portfolio/resume applications: Projects, skills, companies, and their connections
Technical documentation: Dependencies, APIs, and their relationships
After building a knowledge graph: When you have structured entity data to leverage

Try Graph RAG

Graph RAG Chat

Graph RAG: Knowledge Graph + Vector Search

Great for relationship queries, entity connections, and multi-hop reasoning

Combines knowledge graph traversal (Neo4j) with vector search (Pinecone). Click settings to adjust graph traversal depth.

Agentic RAG

Now let's explore the most advanced RAG technique in this guide: Agentic RAG. This approach transforms RAG from a static pipeline into an autonomous agent that reasons about retrieval strategy, iterates when needed, and self-corrects its answers.

The Problem with Static RAG Pipelines

All the techniques we've covered so far share a common limitation: they follow a fixed execution path. Whether you're using Vanilla, Hybrid, or even Corrective RAG, the system executes the same pipeline for every query.

Consider these different questions:

"What's Arnab's name?" → Doesn't need retrieval at all
"What certifications does Arnab have?" → Needs precise keyword matching
"Compare frontend and backend experience" → Needs decomposition into sub-questions
"Tell me about AI/ML projects" → Needs broad, exploratory search

A static pipeline treats all these the same way. Agentic RAG solves this by giving the system agency—the ability to reason, plan, and adapt.

What is Agentic RAG?

Agentic RAG combines the retrieval capabilities of traditional RAG with the reasoning and planning abilities of LLM agents. Instead of a fixed pipeline, an autonomous agent:

Analyzes the query to determine the best approach
Selects from multiple retrieval tools based on query characteristics
Iterates when initial results are insufficient
Decomposes complex questions into manageable sub-questions
Verifies that answers are grounded in retrieved context
Remembers conversation context across turns

This is one of my favorite RAG approaches, and I previously implemented it during my internship at Codemate AI for their web search and codebase context feature: Read about my Agentic RAG implementation at Codemate AI

Industry Adoption

Agentic RAG-style techniques are now widely used by modern coding agents and developer tools:

Cursor and other AI-powered code editors use agent loops to select retrieval strategies
VS Code coding agents employ tool selection for codebase search
Autonomous coding assistants use iterative retrieval to find relevant context

The pattern has proven effective for complex, multi-step reasoning tasks where a single retrieval pass often isn't enough.

How Agentic RAG Works

The architecture has five distinct stages:

Stage 1 (Query Analysis): The agent analyzes the incoming query to determine:

Query type (simple, factual, comparative, exploratory)
Whether decomposition into sub-questions would help
Recommended retrieval strategy
Relevant conversation history

Stage 2 (Tool Selection): Based on the analysis, the agent autonomously selects from available retrieval tools:

Query Pattern	Selected Tool
Conceptual questions	vanilla_search
Specific terms/names	hybrid_search
Precision-critical	rerank_search
Broad/exploratory	multi_query_search
Relationship queries	graph_traversal_search
Entity lookups	entity_lookup

Stage 3 (Iterative Retrieval): The agent executes a retrieval loop:

Execute the selected tool
Evaluate results: Are they sufficient? Relevant?
If insufficient, try a different tool or reformulate the query
Repeat up to 3 iterations

Stage 4 (Verification): Before generating, the agent verifies:

Is the answer grounded in retrieved context?
Are there knowledge gaps to acknowledge?
What confidence level is appropriate?

Stage 5 (Generation): Finally, the LLM generates a response using the accumulated context and verification results.

The Algorithm

// Pseudocode: Agentic RAG
async function agenticRAG(userQuery, conversationHistory) {
  // 1. ANALYZE
  // The agent first reasons about the query
  const analysis = await agent.analyzeQuery(userQuery);

  // 2. PLAN
  // Decide on retrieval strategy (or skip if not needed)
  if (analysis.queryType === 'simple') {
    return await llm.generate(userQuery); // No retrieval needed
  }

  // 3. ITERATIVE RETRIEVAL LOOP
  let context = [];
  for (let i = 0; i < MAX_ITERATIONS; i++) {
    // Agent autonomously selects and calls the best tool
    const toolResult = await agent.selectAndExecuteTool(userQuery, analysis);
    context.push(...toolResult.chunks);

    // Evaluate: Do we have enough?
    if (agent.isResultSufficient(context, userQuery)) break;

    // If not, agent decides: try different tool or reformulate
    analysis.strategy = agent.suggestNextStrategy();
  }

  // 4. VERIFY
  // Self-check before answering
  const verification = await agent.verifyGroundedness(context, userQuery);

  // 5. GENERATE
  // Answer with confidence-aware response
  return await llm.generate(userQuery, context, verification);
}

Pros and Cons

Pros	Cons
Autonomous strategy selection per query	Higher latency (multiple LLM calls)
Handles diverse query types optimally	More complex to implement and debug
Self-corrects when retrieval fails	Higher token usage
Decomposes complex questions	Agent reasoning can be unpredictable
Reduces hallucination via verification	Requires careful prompt engineering
Multi-turn conversation awareness	Harder to optimize for specific use cases

Advanced Optimizations

The implementation includes several production-ready optimizations:

Tool Budgeting: Limits maximum iterations (default: 3) to prevent runaway agent loops and control costs:

stopWhen: stepCountIs(maxIterations + 2); // Allow tool calls + final response

Query-Aware Tool Selection: The agent system prompt includes a decision matrix mapping query patterns to optimal tools, reducing trial-and-error.

Conversation Memory: Uses the AI SDK's built-in message handling for multi-turn context, enabling queries like "What technologies did he use for it?" after discussing a specific project.

Streaming Response: Streams the final answer while tool calls execute in the background, improving perceived latency.

Other Potential Optimizations

While not implemented in the current version, these could further improve performance:

Parallel Tool Execution: For decomposed queries, execute sub-question retrievals in parallel.

Tool Result Caching: Cache results for repeated or similar queries within a session.

Confidence-Based Early Termination: Skip additional iterations when first result has very high relevance scores.

Query Classification Caching: Cache query analysis for similar query patterns.

When to Use Agentic RAG

Complex applications: Where query types vary significantly
Coding assistants: Multi-hop reasoning about codebases
Multi-turn conversations: When context matters across turns
High-stakes answers: Where self-verification is valuable
Exploratory interfaces: When users ask diverse question types
After other techniques plateau: When you've optimized retrieval but need smarter orchestration

Try Agentic RAG

Agentic RAG Chat

Agentic RAG: Autonomous Retrieval

The agent selects tools, iterates, and self-corrects

Click settings to enable/disable retrieval tools. The agent autonomously selects, iterates, and self-corrects. Demo may timeout due to long retrieval times and Vercel free tier function limits.

Conclusion

RAG has evolved from simple vector search to sophisticated autonomous systems. Each technique we explored builds on the previous:

Vanilla RAG establishes the foundation with semantic similarity
Hybrid RAG combines dense and sparse retrieval for better coverage
Re-ranking RAG adds precision through cross-encoder scoring
Multi-Query RAG improves recall through query expansion
Parent-Child RAG balances retrieval precision with context richness
Corrective RAG introduces self-evaluation and confidence awareness
Graph RAG captures entity relationships through knowledge graph traversal
Agentic RAG brings autonomous reasoning and adaptive tool selection

The best RAG system for your use case depends on your specific requirements: latency constraints, accuracy needs, query diversity, and available infrastructure. Start with Vanilla RAG to establish a baseline, then layer in techniques based on where you see gaps.

Once you have a RAG system in production, keeping it fresh becomes the next challenge. Check out my guide on keeping RAG up-to-date with Change Data Capture (CDC) to learn how to stream changes instead of batch re-indexing.

Thank you for reading! I hope these interactive demos helped you understand the nuances of each approach. If you have questions, want to discuss RAG techniques further, or are interested in hiring or collaboration, feel free to reach out at hire@codewarnab.in.

Previous Blog← Understanding Dense Passage Retrieval (DPR): The Engine Behind Modern Search