Interactive Guide to RAG Techniques: From Vanilla to Agentic
- Published on
- Arnab Mondal--45 min read
Overview
- What is RAG?
- What is a Vector Database?
- Naive RAG (Vanilla RAG)
- Hybrid RAG
- Re-ranking RAG
- Multi-Query RAG
- Parent-Child RAG (Hierarchical Chunking)
- Corrective RAG (CRAG)
- Graph RAG
- Agentic RAG
- Conclusion
Ever wondered how ChatGPT-like systems can answer questions about your own documents? The answer is RAG (Retrieval-Augmented Generation). In this interactive guide, you won't just read about RAG—you'll experience it firsthand by chatting with different RAG implementations, all powered by my blog posts and portfolio content. We are going to explore all the RAG techniques today.
What is RAG?
RAG (Retrieval-Augmented Generation) is a game-changing pattern that solves one of the biggest limitations of Large Language Models (LLMs): they don't know about your private data.
Think about it—ChatGPT was trained on internet data up to a certain date. It doesn't know about:
- Your company's internal documents
- Recent blog posts you've written
- The specific codebase of your project
RAG bridges this gap by retrieving relevant context from your own data and feeding it to the LLM along with the user's question.
The RAG Pipeline (High-Level)
What is a Vector Database?
Before diving into RAG techniques, let's understand the backbone of any RAG system: Vector Databases.
The Problem with Traditional Search
Traditional databases are great at exact matches:
SELECT * FROM posts WHERE title = "React hooks"- Find user with
email = "arnab@example.com"
But what if someone asks: "What did Arnab write about frontend frameworks?"
The word "React" might not appear anywhere, but your content about React, Vue, or Next.js is absolutely relevant. This is where semantic search comes in.
Embeddings: Words as Numbers
Embeddings convert text into numerical vectors (arrays of numbers). The magic? Similar meanings produce similar vectors.
In reality, embeddings have 768 to 3072 dimensions, capturing nuanced semantic relationships.
Vector Search: Finding Similar Meanings
A vector database like Pinecone stores these embeddings and enables lightning-fast similarity searches using algorithms like:
- Cosine Similarity: Measures the angle between vectors
- Euclidean Distance: Measures straight-line distance
- Dot Product: Combines magnitude and direction
When you query "frontend frameworks", the database finds vectors closest to your query's embedding—even if the stored documents say "React", "Vue", or "Next.js".
Naive RAG (Vanilla RAG)
Now let's dive into the first and simplest RAG technique: Naive RAG (also called Vanilla RAG). In simple terms, Naive RAG is like an open-book exam. When you ask a question, the system searches your documents for the most relevant parts. It then gives those parts to the AI as "cheat sheets" so it can answer based on your specific data instead of just guessing.
How Vanilla RAG Works
The Algorithm
view the full source code here
Pros and Cons
| Pros | Cons |
|---|---|
| Simple to implement | No query understanding |
| Fast retrieval | Fixed chunk boundaries |
| Works well for straightforward questions | May miss relevant context |
| Low computational overhead | Struggles with multi-hop reasoning |
| Great starting baseline | No semantic ranking |
When to Use Vanilla RAG
- Prototyping: Quick proof-of-concept
- Simple Q&A: Direct questions with clear answers in documents
- Resource constraints: When you need minimal latency and cost
- Baseline comparison: Before trying advanced techniques
Try Vanilla RAG Yourself
Vanilla RAG Chat
Ask anything about Arnab's portfolio
Try clicking one of the suggestions below
Powered by Gemini 2.5 Flash + Pinecone DB. Searches across blog posts and portfolio content.
Hybrid RAG
Now let's explore a more powerful technique: Hybrid RAG. This approach addresses a fundamental limitation of Vanilla RAG by combining the best of both worlds—semantic understanding AND exact keyword matching.
The Problem with Vanilla RAG
Remember how Vanilla RAG uses only dense embeddings for retrieval? While semantic search is great at understanding meaning, it has a critical weakness:
It struggles with specific terms, names, and technical jargon.
Consider this query: "What projects use Next.js 14?"
- Dense retrieval might find content about "modern React frameworks" or "server-side rendering"—semantically related, but missing the exact version.
- A document mentioning "Utilizes Next.js 14 app router" might rank lower than "Introduction to React frameworks" because the embedding model doesn't weight "14" as heavily as the semantic meaning.
This is where Hybrid RAG shines.
What is Hybrid RAG?
Hybrid RAG combines dense retrieval (semantic embeddings) with sparse retrieval (keyword matching like BM25) to overcome the limitations of using either approach alone.
Think of it like having two search experts:
- Expert A (Dense): Understands the meaning and intent behind your question
- Expert B (Sparse): Excellent at finding exact matches for specific terms
Hybrid RAG asks both experts and intelligently combines their answers.
Want to understand how dense retrieval works under the hood? Check out my deep dive on Dense Passage Retrieval (DPR).
How Hybrid RAG Works
The Algorithm
To combine the results, Hybrid RAG typically uses Reciprocal Rank Fusion (RRF). This algorithm, detailed in the paper Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods, works by ranking documents based on their position in the search results rather than their raw confidence scores. This ensures a fair comparison between the disparate scoring scales of dense and sparse retrieval.
view the full source code here
Pros and Cons
| Pros | Cons |
|---|---|
| Best of both semantic and keyword search | More complex infrastructure |
| Handles specific terms and jargon well | Two search indices to maintain |
| More robust to vocabulary mismatch | Slightly higher latency (parallel helps) |
| Better recall for diverse query types | Fusion algorithm tuning required |
| Industry-proven approach | Higher resource usage |
When to Use Hybrid RAG
- Technical documentation: Exact API names, version numbers, function names matter
- Legal/medical content: Precise terminology is critical
- E-commerce: Product codes, model numbers, brand names
- Enterprise search: Internal jargon, project names, acronyms
- When Vanilla RAG isn't cutting it: Upgrade path when semantic-only fails
Try Hybrid RAG Yourself
Hybrid RAG Chat
Hybrid RAG: Semantic + Keyword Search
Great for specific terms, versions, and technical queries
Combines dense (semantic) + sparse (BM25) search. Click the settings icon to adjust the balance between keyword precision and semantic understanding.
Re-ranking RAG
Now let's level up with Re-ranking RAG—a technique that adds a precision layer on top of your retrieval. This is often the single biggest improvement you can make to RAG quality without changing your data or embeddings.
The Problem with Single-Stage Retrieval
Both Vanilla and Hybrid RAG use bi-encoders for retrieval—models that encode queries and documents independently:
The fundamental limitation: There's no direct interaction between query and document tokens. The model can't understand nuanced relationships like:
- "What Arnab did NOT do" vs "What Arnab did"
- "AWS certification" vs "uses AWS for deployment"
- "internship at AI company" vs "worked on AI projects"
This leads to false positives—documents that are semantically similar but not actually relevant to the specific question.
What is Re-ranking RAG?
Re-ranking RAG adds a second-stage precision filter using a cross-encoder:
In simple words, it's like Googling something and then carefully reading the top 50 results to pick the absolute best 5 answers, instead of just trusting the search engine's top 5 blindly. First, you retrieve a lot of results fast (Retrieval), then you re-score them carefully (Re-ranking).
| Stage | Model Type | Speed | Precision | Purpose |
|---|---|---|---|---|
| Stage 1 | Bi-Encoder | ⚡ Fast | Medium | Cast wide net (high recall) |
| Stage 2 | Cross-Encoder | 🐢 Slower | High | Precision filtering |
The key insight: Cross-encoders process query and document together:
[CLS](Classification): Added at the very beginning. The model uses the final state of this token to represent the entire input pair (Query + Document) and predict the relevance score.[SEP](Separator): Added between the Query and the Document to tell the model where one ends and the other begins (and also at the end).
With full attention between query and document tokens, cross-encoders capture semantic relationships, negations, and context that bi-encoders miss.
How Re-ranking RAG Works
The Algorithm
view the full source code here
Which Re-ranker Model?
This implementation uses Cohere's rerank-v4.0-fast, chosen for:
- Low latency (~100-200ms) suitable for real-time applications
- High quality comparable to larger models
- Multilingual support for diverse content
- 128K context handles long documents
- Free tier (1000 calls/month) for experimentation
Other options include Cohere's rerank-v4.0-pro (higher quality, higher latency), Jina AI Re-ranker, Voyage AI rerank-2, or self-hosted models like cross-encoder/ms-marco-MiniLM-L-6-v2.
Pros and Cons
| Pros | Cons |
|---|---|
| Significantly higher precision | Additional latency (100-300ms) |
| Understands negations and context | Extra API cost per query |
| Works on top of existing retrieval | Limited by candidate quality |
| No re-indexing required | More complex pipeline |
| Often the biggest single improvement | Diminishing returns past ~50 candidates |
Advanced Optimizations
The implementation in this demo includes several production-ready optimizations:
Score Fusion: Combines the original retrieval score with the re-rank score for nuanced ranking:
Adaptive Re-ranking: Skips re-ranking when the initial retrieval confidence is already very high (e.g. top result score > 0.95 with clear separation).
Query-Aware Over-retrieval: Retrieves more candidates for complex/vague queries and fewer for specific ones to balance recall and latency.
Other Potential Optimizations
While not implemented in this specific demo, production systems often include:
Deduplication: Removing near-duplicate chunks before re-ranking to maximize candidate diversity.
Caching: Caching re-ranking results for repeated queries to reduce API costs and latency.
When to Use Re-ranking RAG
- Precision-critical applications: When wrong answers are costly
- Complex queries: Multi-faceted questions requiring nuanced understanding
- After optimizing retrieval: When you've maxed out embedding/chunking improvements
- Technical content: Where subtle differences in wording matter greatly
- Production RAG systems: The standard for quality-focused deployments
Try Re-ranking RAG Yourself
Re-ranking RAG Chat
Re-ranking RAG: Two-Stage Precision Retrieval
Cross-encoder re-ranking for higher precision answers
Two-stage retrieval: Click the settings icon to tune candidate count and final results. Uses Cohere cross-encoder for precision re-ranking.
Multi-Query RAG
Now let's explore Multi-Query RAG—a technique that dramatically improves recall by reformulating your question into multiple diverse queries. This addresses a fundamental limitation: a single query often misses relevant documents due to vocabulary mismatch.
The Problem with Single-Query Retrieval
Consider a user asking: "What machine learning projects has Arnab worked on?"
Even with the best embeddings, a single query might miss:
- Documents using "AI" instead of "machine learning"
- Content about "neural networks" or "deep learning"
- Projects mentioning "TensorFlow" or "PyTorch" without saying "ML"
- Experience with "NLP" or "computer vision"
The embedding model does its best, but semantic similarity has limits—especially with technical jargon, acronyms, and domain-specific terms.
What is Multi-Query RAG?
Multi-Query RAG uses an LLM to generate alternative phrasings of the user's question, then retrieves documents for all variations in parallel. The results are combined using Reciprocal Rank Fusion (RRF) to create a comprehensive, diverse result set.
Think of it like asking the same question in multiple ways to different search engines, then combining the best answers from each.
How Multi-Query RAG Works
The Algorithm
view the full source code here
Pros and Cons
| Pros | Cons |
|---|---|
| Significantly higher recall | LLM call adds latency (~100-300ms) |
| Handles vocabulary mismatch | Multiplies embedding calls (4-5x) |
| Documents "voted" by multiple queries rank higher | More API costs per query |
| Captures different perspectives of the question | Query generation quality matters |
| Works with any base retrieval method | Diminishing returns past 5-6 variations |
Advanced Optimizations
The implementation in this demo includes several production-ready optimizations:
Deduplication by Similarity: Removes near-duplicate chunks that appear across different query results. Adjacent chunks from the same document are merged to maximize diversity:
Weighted Score Fusion: An alternative to pure RRF that considers the original similarity scores. Useful when you trust the embedding quality:
Hybrid Base Retrieval: Can use either vanilla (dense-only) or hybrid (dense + sparse) retrieval as the base for each query variation, combining the benefits of both approaches.
Re-ranking After Fusion: Optionally apply cross-encoder re-ranking to the fused results. This combines the high recall of multi-query with the precision of re-ranking, and is especially powerful when fused results contain many "voted" documents that need precision ranking.
Other Potential Optimizations
While not implemented in this specific demo, production systems often include:
Query Caching: Cache generated query variations for repeated or similar queries to reduce LLM calls.
Adaptive Query Count: Generate fewer variations for specific queries and more for vague or exploratory ones.
Query Quality Filtering: Score generated queries and drop low-quality variations before retrieval.
When to Use Multi-Query RAG
- Exploratory queries: Broad questions that could match many document types
- Domain-specific content: Technical jargon where synonyms matter
- Comprehensive answers: When you need to cover all angles of a topic
- Recall-critical applications: When missing relevant documents is costly
- Before re-ranking: Generate diverse candidates, then use cross-encoder to precision-rank
Try Multi-Query RAG Yourself
Multi-Query RAG Chat
Multi-Query RAG: Diverse Query Expansion
LLM generates query variations for comprehensive coverage
Click the settings icon to configure query variations, retrieval method, and re-ranking. Results are fused with Reciprocal Rank Fusion (RRF).
Parent-Child RAG (Hierarchical Chunking)
Now let's explore Parent-Child RAG—a technique that solves the fundamental tension between retrieval precision and generation quality by using two levels of chunks: small chunks for searching, large chunks for answering.
The Problem with Fixed Chunk Sizes
Consider how we chunk documents in standard RAG:
| Chunk Size | Retrieval Quality | Generation Quality |
|---|---|---|
| Small (200-300 tokens) | High precision - focused embeddings | Poor - lost context, fragmented info |
| Large (1000+ tokens) | Low precision - diluted embeddings | Good - rich context for LLM |
The dilemma: Small chunks give precise retrieval but poor context. Large chunks give rich context but imprecise retrieval. You can't win with a single chunk size.
What is Parent-Child RAG?
Parent-Child RAG uses two chunk sizes to get the best of both worlds:
In simple words, you search using small detailed snippets to find specific matches, but you give the AI a big, complete chunk of text (the parent) so it understands the full context and can answer correctly.
| Level | Size | Purpose | Storage |
|---|---|---|---|
| Child | 200-400 tokens | Retrieval (precise vector matching) | Vector index (searchable) |
| Parent | 800-1500 tokens | Generation (context for LLM) | Metadata (returned, not searched) |
The key insight: search on small, focused chunks but return large, context-rich chunks to the LLM.
How Parent-Child RAG Works
The architecture has two distinct phases:
Ingestion Time (Steps 1-3):
- Split document into large parent chunks (~1000 tokens)
- Split each parent into small child chunks (~200 tokens)
- Embed only child chunks, storing
parentIdandparentContentin metadata
Query Time (Steps 4-5): 4. Search child vectors, find matches, look up parent via parentId 5. Return deduplicated parent chunks to LLM for generation
The Algorithm
Example: Why Parent Context Matters
Query: "Einstein 1905 theory"
Child Match (200 tokens):
Parent Returned (1000 tokens):
The LLM now has full biographical context—not just the 1905 mention, but Einstein's background, education, and the significance of his discoveries.
Pros and Cons
| Pros | Cons |
|---|---|
| Precise retrieval with focused embeddings | Increased storage (parent content in metadata) |
| Rich context for coherent LLM responses | Slightly more complex ingestion pipeline |
| No mid-sentence cuts - parents are semantically complete | Metadata size limits (Pinecone: 40KB) |
| Reduced hallucination - more context = less guessing | Fixed parent boundaries may split related content |
| Deduplication handles multiple child matches naturally | More embedding calls during ingestion |
Advanced Optimizations
The implementation includes several production-ready optimizations:
Metadata-Based Storage: Parent content is stored directly in child chunk metadata, eliminating the need for a separate storage system or secondary lookups:
Parent Deduplication: When multiple children from the same parent match the query, the parent is returned only once with the best child's score. This prevents context duplication and maximizes diversity.
Optional Re-ranking: After parent deduplication, a cross-encoder can re-rank the parent chunks for even higher precision—combining the high recall of hierarchical retrieval with the accuracy of re-ranking.
Other Potential Optimizations
While not implemented in the current version, these optimizations could further improve performance:
Sliding Window Parents: Create parents with overlap to ensure related content at boundaries isn't split awkwardly.
Dynamic Parent Sizing: Adjust parent boundaries based on content structure (e.g., respect section headers).
Multi-Level Hierarchy: For very long documents, add a grandparent level (Grandparent → Parent → Child).
When to Use Parent-Child RAG
- Long-form content: Blog posts, documentation, technical articles
- Context-dependent answers: When surrounding information is critical
- Reduced hallucination: When accuracy matters more than speed
- After vanilla RAG struggles: Natural upgrade path for better context
- Code documentation: Where function context needs surrounding module info
Try Parent-Child RAG Yourself
Parent-Child RAG Chat
Parent-Child RAG: Hierarchical Chunking
Search small chunks, return large context-rich parents
Click the settings icon to configure child search and parent re-ranking. Best for documentation and long-form content.
The LLM can now provide a complete, actionable answer instead of a fragment.
Corrective RAG (CRAG)
Now let's explore Corrective RAG—a self-correcting technique that evaluates retrieved documents for relevance and takes corrective actions when retrieval quality is insufficient. This addresses a critical limitation: standard RAG blindly trusts whatever documents are retrieved.
The Problem with Blind Trust
Consider what happens in standard RAG when someone asks: "What is Arnab's experience with Kubernetes?"
If your vector database doesn't contain K8s-specific content, it will still return the "closest" documents—perhaps posts about Docker, cloud infrastructure, or general DevOps. Standard RAG will then confidently generate an answer based on irrelevant context, potentially hallucinating K8s experience that doesn't exist.
The fundamental issue: RAG systems have no way to know when retrieval fails.
What is Corrective RAG?
Corrective RAG (CRAG) adds a relevance evaluation layer between retrieval and generation. Based on the paper Corrective Retrieval Augmented Generation (Yan et al., 2024), it introduces:
- Relevance Evaluation: Grade each document as CORRECT, AMBIGUOUS, or INCORRECT
- Confidence Assessment: Calculate overall retrieval confidence (HIGH/MEDIUM/LOW)
- Corrective Actions: Take different actions based on confidence level
- Confidence-Aware Generation: Adapt the LLM prompt based on retrieval quality
In simple words, imagine a strict teacher grading the search results. If the results are good (Correct), the teacher hands them to the student (LLM). If they are mixed (Ambiguous), the teacher filters out the bad parts. If they are completely wrong (Incorrect), the teacher tells the student to look elsewhere (web search) or admit they don't know, instead of making things up.
How Corrective RAG Works
The architecture has four distinct stages:
Stage 1 (Retrieval): Standard vector search retrieves top-K documents—same as Vanilla RAG.
Stage 2 (Evaluation): An LLM-as-judge evaluates each document's relevance to the query:
- CORRECT: Document directly answers the query
- AMBIGUOUS: Related but insufficient
- INCORRECT: Irrelevant to the query
Stage 3 (Corrective Action): Based on the overall confidence score:
- HIGH → Use CORRECT documents directly
- MEDIUM → Refine knowledge by extracting relevant strips
- LOW → Communicate uncertainty honestly
Stage 4 (Generation): LLM generates with a confidence-aware prompt that adapts behavior based on retrieval quality.
The Algorithm
view the full source code here
Pros and Cons
| Pros | Cons |
|---|---|
| Detects and filters irrelevant documents | Additional LLM calls for evaluation (latency) |
| Reduces hallucinations from bad retrieval | Higher token usage per query |
| Provides honest uncertainty when knowledge is lacking | Evaluation quality depends on prompt engineering |
| Confidence-aware prompts improve response quality | May be overly conservative (false negatives) |
| Works on top of any base retrieval method | More complex pipeline to maintain |
Advanced Optimizations
The implementation includes several production-ready optimizations:
Knowledge Refinement: For MEDIUM confidence, documents are decomposed into atomic knowledge strips, each classified by type (fact, context, example, definition) and relevance score. Only relevant strips are recomposed into the final context:
Confidence-Aware Prompting: The system prompt adapts to retrieval quality:
- HIGH: "Answer confidently, cite sources"
- MEDIUM: "Acknowledge limitations, focus on relevant parts"
- LOW: "Be honest about limitations, avoid speculation"
Quick Refine Mode: For faster processing, a simplified refinement path extracts only the relevant excerpts identified during evaluation, skipping full decomposition.
Other Potential Optimizations
While not implemented in this specific version, production systems often include:
Hybrid Evaluation: Use reranker scores for fast pre-filtering, then LLM evaluation only for ambiguous cases (saves ~50% of evaluation calls).
Evaluation Caching: Cache evaluation results for repeated queries to reduce LLM calls.
Adaptive Confidence Thresholds: Adjust HIGH/MEDIUM/LOW thresholds based on query complexity or domain.
When to Use Corrective RAG
- High-stakes applications: When wrong answers are costly (medical, legal, financial)
- Knowledge-gap detection: When you need to know if information exists
- User trust: When users need to know the system's confidence level
- Heterogeneous content: When document quality varies significantly
- After other techniques fail: When you've optimized retrieval but still see hallucinations
Try Corrective RAG
Corrective RAG Chat
Corrective RAG: Self-Correcting Retrieval
Evaluates document relevance and adapts based on confidence
Click settings to configure retrieval and refinement. Uses LLM-as-judge to evaluate document relevance and take corrective action.
Graph RAG
Now let's explore Graph RAG—a technique that extends traditional RAG with knowledge graphs to capture entities, relationships, and multi-hop connections that flat document retrieval misses. This approach excels at understanding how concepts relate to each other.
The Problem with Flat Retrieval
Standard RAG treats documents as isolated chunks. When you ask: "What technologies did Arnab use in projects that won hackathons?"
A chunk-based approach struggles because:
- The hackathon win is mentioned in one chunk
- The project name might be in that chunk or another
- The technologies used are described elsewhere
There's no explicit connection between these pieces. The embedding model hopes semantic similarity bridges the gap, but complex multi-hop queries often fail.
What is Graph RAG?
Graph RAG combines knowledge graph traversal with vector similarity search for comprehensive retrieval:
In simple words, standard vector search finds things that "sound similar" (like finding a book by its cover description), while Graph RAG finds things that are "connected" (like finding a book because the author also wrote your favorite novel). It connects the dots between different pieces of information that might not use the same words but are related.
| Component | What It Captures | Query Strength | Storage |
|---|---|---|---|
| Knowledge Graph | Entities, relationships, structure | Multi-hop, relational queries | Neo4j (graph database) |
| Vector Store | Semantic similarity, content | Concept matching, fuzzy search | Pinecone (vector database) |
The key insight: traverse relationships when entities are explicitly connected, fall back to vectors when semantic similarity is needed, and fuse both for comprehensive coverage.
Relationship to Existing RAG
Graph RAG in this implementation builds on and reuses components from the standard RAG pipeline:
Shared Components (from normal RAG):
- Embedding generation for vector search
- Vector similarity search (Pinecone)
- Cross-encoder re-ranking for precision
- Reciprocal Rank Fusion for combining results
Graph-Specific Additions:
- LLM-based entity extraction from queries
- Graph node matching and relationship traversal
- Subgraph context formatting for the LLM
This modular design means Graph RAG enhances rather than replaces your existing RAG infrastructure.
How Graph RAG Works
The architecture has four main stages:
Stage 1 (Entity Extraction): An LLM extracts entities from the user's query:
- "What AI projects did Arnab build?" →
[Arnab Mondal (Person), AI (Concept)]
Stage 2a (Graph Lookup & Traversal): Query Neo4j to find matching nodes, then traverse relationships up to N hops:
Stage 2b (Vector Search): In parallel, run standard vector similarity search for semantic matches.
Stage 3 (Fusion): Combine graph results (converted to pseudo-chunks) with vector results using Reciprocal Rank Fusion, then optionally re-rank.
Stage 4 (Generation): LLM generates with both graph context (entities, relationships) and document content.
The Algorithm
view the full source code here
Pros and Cons
| Pros | Cons |
|---|---|
| Captures explicit relationships between entities | Requires graph database (Neo4j) infrastructure |
| Handles multi-hop queries naturally | Entity extraction adds latency (~200ms) |
| Structured entity context for LLM | Graph ingestion pipeline needed |
| Parallel graph + vector for coverage | Schema design requires domain knowledge |
| Reuses existing RAG components | Cold-start: graph must be populated |
Advanced Optimizations
The implementation includes several production-ready optimizations:
Configurable Traversal Depth: Limit graph expansion with maxHops parameter (default: 2) to balance coverage and performance:
Parallel Retrieval: Graph and vector searches run concurrently to minimize latency:
Graceful Degradation: If Neo4j is unavailable, falls back to vector-only search. If Pinecone fails, uses graph-only results.
Entity Type Filtering: The graph schema supports 17 entity types (Person, Project, Technology, Company, Skill, etc.) and 50+ relationship types, enabling precise queries.
Other Potential Optimizations
While not implemented in the current version, these could further improve performance:
Entity Caching: Cache entity extraction results for repeated query patterns.
Precomputed Subgraphs: For common query types, precompute relevant subgraphs during ingestion.
Hybrid Entity Matching: Combine exact name matching with fuzzy/embedding-based entity resolution.
When to Use Graph RAG
- Knowledge-intensive domains: Where entities and relationships are well-defined
- Multi-hop queries: Questions requiring traversal between connected concepts
- Portfolio/resume applications: Projects, skills, companies, and their connections
- Technical documentation: Dependencies, APIs, and their relationships
- After building a knowledge graph: When you have structured entity data to leverage
Try Graph RAG
Graph RAG Chat
Graph RAG: Knowledge Graph + Vector Search
Great for relationship queries, entity connections, and multi-hop reasoning
Combines knowledge graph traversal (Neo4j) with vector search (Pinecone). Click settings to adjust graph traversal depth.
Agentic RAG
Now let's explore the most advanced RAG technique in this guide: Agentic RAG. This approach transforms RAG from a static pipeline into an autonomous agent that reasons about retrieval strategy, iterates when needed, and self-corrects its answers.
The Problem with Static RAG Pipelines
All the techniques we've covered so far share a common limitation: they follow a fixed execution path. Whether you're using Vanilla, Hybrid, or even Corrective RAG, the system executes the same pipeline for every query.
Consider these different questions:
- "What's Arnab's name?" → Doesn't need retrieval at all
- "What certifications does Arnab have?" → Needs precise keyword matching
- "Compare frontend and backend experience" → Needs decomposition into sub-questions
- "Tell me about AI/ML projects" → Needs broad, exploratory search
A static pipeline treats all these the same way. Agentic RAG solves this by giving the system agency—the ability to reason, plan, and adapt.
What is Agentic RAG?
Agentic RAG combines the retrieval capabilities of traditional RAG with the reasoning and planning abilities of LLM agents. Instead of a fixed pipeline, an autonomous agent:
- Analyzes the query to determine the best approach
- Selects from multiple retrieval tools based on query characteristics
- Iterates when initial results are insufficient
- Decomposes complex questions into manageable sub-questions
- Verifies that answers are grounded in retrieved context
- Remembers conversation context across turns
This is one of my favorite RAG approaches, and I previously implemented it during my internship at Codemate AI for their web search and codebase context feature: Read about my Agentic RAG implementation at Codemate AI
Industry Adoption
Agentic RAG-style techniques are now widely used by modern coding agents and developer tools:
- Cursor and other AI-powered code editors use agent loops to select retrieval strategies
- VS Code coding agents employ tool selection for codebase search
- Autonomous coding assistants use iterative retrieval to find relevant context
The pattern has proven effective for complex, multi-step reasoning tasks where a single retrieval pass often isn't enough.
How Agentic RAG Works
The architecture has five distinct stages:
Stage 1 (Query Analysis): The agent analyzes the incoming query to determine:
- Query type (simple, factual, comparative, exploratory)
- Whether decomposition into sub-questions would help
- Recommended retrieval strategy
- Relevant conversation history
Stage 2 (Tool Selection): Based on the analysis, the agent autonomously selects from available retrieval tools:
| Query Pattern | Selected Tool |
|---|---|
| Conceptual questions | vanilla_search |
| Specific terms/names | hybrid_search |
| Precision-critical | rerank_search |
| Broad/exploratory | multi_query_search |
| Relationship queries | graph_traversal_search |
| Entity lookups | entity_lookup |
Stage 3 (Iterative Retrieval): The agent executes a retrieval loop:
- Execute the selected tool
- Evaluate results: Are they sufficient? Relevant?
- If insufficient, try a different tool or reformulate the query
- Repeat up to 3 iterations
Stage 4 (Verification): Before generating, the agent verifies:
- Is the answer grounded in retrieved context?
- Are there knowledge gaps to acknowledge?
- What confidence level is appropriate?
Stage 5 (Generation): Finally, the LLM generates a response using the accumulated context and verification results.
The Algorithm
view the full source code here
Pros and Cons
| Pros | Cons |
|---|---|
| Autonomous strategy selection per query | Higher latency (multiple LLM calls) |
| Handles diverse query types optimally | More complex to implement and debug |
| Self-corrects when retrieval fails | Higher token usage |
| Decomposes complex questions | Agent reasoning can be unpredictable |
| Reduces hallucination via verification | Requires careful prompt engineering |
| Multi-turn conversation awareness | Harder to optimize for specific use cases |
Advanced Optimizations
The implementation includes several production-ready optimizations:
Tool Budgeting: Limits maximum iterations (default: 3) to prevent runaway agent loops and control costs:
Query-Aware Tool Selection: The agent system prompt includes a decision matrix mapping query patterns to optimal tools, reducing trial-and-error.
Conversation Memory: Uses the AI SDK's built-in message handling for multi-turn context, enabling queries like "What technologies did he use for it?" after discussing a specific project.
Streaming Response: Streams the final answer while tool calls execute in the background, improving perceived latency.
Other Potential Optimizations
While not implemented in the current version, these could further improve performance:
Parallel Tool Execution: For decomposed queries, execute sub-question retrievals in parallel.
Tool Result Caching: Cache results for repeated or similar queries within a session.
Confidence-Based Early Termination: Skip additional iterations when first result has very high relevance scores.
Query Classification Caching: Cache query analysis for similar query patterns.
When to Use Agentic RAG
- Complex applications: Where query types vary significantly
- Coding assistants: Multi-hop reasoning about codebases
- Multi-turn conversations: When context matters across turns
- High-stakes answers: Where self-verification is valuable
- Exploratory interfaces: When users ask diverse question types
- After other techniques plateau: When you've optimized retrieval but need smarter orchestration
Try Agentic RAG
Agentic RAG Chat
Agentic RAG: Autonomous Retrieval
The agent selects tools, iterates, and self-corrects
Click settings to enable/disable retrieval tools. The agent autonomously selects, iterates, and self-corrects. Demo may timeout due to long retrieval times and Vercel free tier function limits.
Conclusion
RAG has evolved from simple vector search to sophisticated autonomous systems. Each technique we explored builds on the previous:
- Vanilla RAG establishes the foundation with semantic similarity
- Hybrid RAG combines dense and sparse retrieval for better coverage
- Re-ranking RAG adds precision through cross-encoder scoring
- Multi-Query RAG improves recall through query expansion
- Parent-Child RAG balances retrieval precision with context richness
- Corrective RAG introduces self-evaluation and confidence awareness
- Graph RAG captures entity relationships through knowledge graph traversal
- Agentic RAG brings autonomous reasoning and adaptive tool selection
The best RAG system for your use case depends on your specific requirements: latency constraints, accuracy needs, query diversity, and available infrastructure. Start with Vanilla RAG to establish a baseline, then layer in techniques based on where you see gaps.
Once you have a RAG system in production, keeping it fresh becomes the next challenge. Check out my guide on keeping RAG up-to-date with Change Data Capture (CDC) to learn how to stream changes instead of batch re-indexing.
Thank you for reading! I hope these interactive demos helped you understand the nuances of each approach. If you have questions, want to discuss RAG techniques further, or are interested in hiring or collaboration, feel free to reach out at hire@codewarnab.in.