04. Retrieval-Augmented Generation#

Large Language Models (LLMs) are brilliant generalists — they’ve read the internet and can reason across domains — but they don’t know what they haven’t seen. Their parameters store general knowledge, not private, up-to-date, or domain-specific facts. Retrieval-Augmented Generation (RAG) bridges that gap. It combines:

  1. Retrieval – find relevant information from an external knowledge base (e.g., docs, databases, websites).

  2. Generation – pass that retrieved context into an LLM to ground its answer.

This simple loop — retrieve → augment → generate — makes the model:

  • More accurate (uses real facts, not hallucinations)

  • More current (retrieval can include recent or proprietary data)

  • Cheaper & smaller (you don’t need to fine-tune large models for every dataset)

  • Explainable (you can trace answers back to the retrieved sources)

RAG is now the foundation of modern enterprise AI systems, powering products like search-chat hybrids, coding copilots, knowledge assistants, and customer-support bots. In short: RAG makes LLMs grounded, trustworthy, and useful in the real world.

I highly recommend watching explanations of RAG from IBM and Cole Medin.

Scenario: Pokémon!#

Let’s take something seemingly simple — asking questions about Pokémon species like Pikachu, Charizard, or Mewtwo. At first glance, LLMs might seem to know this, but there are hidden challenges:

Problem

Why it’s hard for an LLM

Data freshness

Game mechanics, move sets, and forms change with every generation — LLMs trained on older data may be outdated.

Structured facts

Evolution trees, base stats, and type matchups are stored in tables, not prose — hard for models to memorize precisely.

Ambiguity

Words like “form”, “Mega Evolution”, “TM”, or “base stats” require domain-specific interpretation.

Compositional queries

“Which Pokémon evolves into Pikachu?” or “List Charizard’s Mega forms and their base stats” require multiple lookups and reasoning steps.

When we ask these zero-shot, even the best LLMs often hallucinate:

  • inventing fake evolution lines,

  • mixing up stats across generations,

  • or returning vague, generic answers.

That’s where RAG shines:

  • We retrieve the real Pokémon data (from pokemondb.net in this tutorial).

  • We chunk and embed those markdown pages in a vector database (LanceDB).

  • Then, for each query, we retrieve the most relevant chunks and let the LLM reason grounded in evidence.

So instead of guessing, our agent reads and reasons. This setup scales naturally to enterprise settings — from Pokémon encyclopedias to product catalogs, regulatory documents, or customer knowledge bases.

For our data, we use the PokemonDB. We’ll fetch: pichu, pikachu, raichu, charizard, mewtwo, slowpoke and save as .md. These pages are HTML; we’ll convert to Markdown for easier chunking.

import requests, pathlib
from markdownify import markdownify as mdify

# Saving data for common pokemons
POKEMON = [
    ("pichu",     "https://pokemondb.net/pokedex/pichu"),
    ("pikachu",   "https://pokemondb.net/pokedex/pikachu"),
    ("raichu",    "https://pokemondb.net/pokedex/raichu"),
    ("charizard", "https://pokemondb.net/pokedex/charizard"),
    ("mewtwo",    "https://pokemondb.net/pokedex/mewtwo"),
    ("slowpoke",  "https://pokemondb.net/pokedex/slowpoke"),
]

def fetch_markdown(url: str) -> str:
    html = requests.get(url, timeout=30).text
    md = mdify(html, heading_style="ATX")
    return md

DATA_DIR = pathlib.Path("./data/pokemon_md")

downloaded = []
for name, url in POKEMON:
    md_text = fetch_markdown(url)
    path = DATA_DIR / f"{name}.md"
    path.write_text(md_text, encoding="utf-8")
    downloaded.append((name, str(path), url))

print(f"Saved {len(downloaded)} markdown files → {DATA_DIR}")
Saved 6 markdown files → data\pokemon_md

Let’s see what a sample of this data page looks like.

from IPython.display import Markdown, display
import pathlib

md_path = pathlib.Path("./data/pokemon_md/pikachu.md")
display(Markdown(md_path.read_text(encoding="utf-8")[3000:4000]))  # first 2000 chars

Pikachu artwork by Ken Sugimori

Additional artwork

Pokédex data

| | | | — | — | | National № | 0025 | | Type | Electric | | Species | Mouse Pokémon | | Height | 0.4 m (1′04″) | | Weight | 6.0 kg (13.2 lbs) | | Abilities | 1. Static Lightning Rod (hidden ability) | | Local № | 0025 (Yellow/Red/Blue) 0022 (Gold/Silver/Crystal) 0156 (Ruby/Sapphire/Emerald) 0025 (FireRed/LeafGreen) 0104 (Diamond/Pearl) 0104 (Platinum) 0022 (HeartGold/SoulSilver) 0036 (X/Y — Central Kalos) 0163 (Omega Ruby/Alpha Sapphire) 0025 (Sun/Moon — Alola dex) 0032 (U.Sun/U.Moon — Alola dex) 0025 (Let’s Go Pikachu/Let’s Go Eevee) 0194 (Sword/Shield) 0104 (Brilliant Diamond/Shining Pearl) 0056 (Legends: Arceus) 0074

Preparing our Knowledge Base#

Now that we’ve downloaded Pokémon data as .md files (for Pikachu, Charizard, Mewtwo, etc.),
we need to split the text into smaller chunks before embedding it into a vector database.

Why?

  • LLMs and embeddings have context limits — we can’t feed the entire document at once.

  • Smaller, semantically coherent chunks help retrieval systems match relevant sections precisely.

  • Chunking also improves Recall@k, latency, and embedding reuse during updates.

We’ll try two common splitting strategies:

Splitter

Description

When to use

🧩 RecursiveCharacterTextSplitter

Splits text purely by length, preserving overlap.

Generic text without structure.

🧱 MarkdownHeaderTextSplitter

Splits along Markdown headers (#, ##, ###), then limits size.

Structured content (docs, wikis, pages like Pokémon DB).

After chunking, we’ll have two parallel sets of documents:

  • docs_rec: recursively chunked plain text

  • docs_md: structure-aware markdown chunks

These will later be embedded into LanceDB and compared for retrieval quality.

from typing import List, Dict, Any, Optional, Tuple
import os

from langchain_text_splitters import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)

# --- Chunking params ---
CHUNK_SIZE = 700
CHUNK_OVERLAP = 120

# --- Eval/profiling ---
EVAL_K_LIST = [1, 3, 5]
EMBEDDING_COST_PER_1K = float(os.getenv("EMBED_COST_PER_1K", "0.00013"))  # USD
PRINT_TOP_N = 5

def read_files_as_object_array(directory_path: str) -> List[Dict[str, str]]:
    out = []
    for fname in os.listdir(directory_path):
        fpath = os.path.join(directory_path, fname)
        if os.path.isfile(fpath):
            with open(fpath, "r", encoding="utf-8") as f:
                out.append({"filename": fname, "content": f.read()})
    return out

def recursive_text_splitter(data, chunk_size, overlap_size):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=overlap_size, length_function=len, is_separator_regex=False
    )
    texts = splitter.create_documents(
        [f"{d['filename']}\n{d['content']}" for d in data],
        metadatas=[{"filename": d["filename"]} for d in data],
    )
    return texts

def markdown_splitter(data, chunk_size, overlap_size):
    md_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")], strip_headers=True
    )
    md_splits = []
    for d in data:
        splits = md_splitter.split_text(d["content"])
        for s in splits:
            s.metadata["filename"] = d["filename"]
        md_splits.extend(splits)

    size_limiter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=overlap_size, length_function=len, is_separator_regex=False
    )
    return size_limiter.split_documents(md_splits)

docs_raw = read_files_as_object_array(str(DATA_DIR))
docs_rec = recursive_text_splitter(docs_raw, CHUNK_SIZE, CHUNK_OVERLAP)
docs_md  = markdown_splitter(docs_raw, CHUNK_SIZE, CHUNK_OVERLAP)

print(f"Recursive chunks: {len(docs_rec)} | Markdown+size chunks: {len(docs_md)}")
Recursive chunks: 526 | Markdown+size chunks: 509

Building our Vector Database#

Before our agent can “retrieve” knowledge, we need a database that understands vectors — numerical representations of text meaning (embeddings). That’s where LanceDB comes in.

🔍 What is LanceDB?#

LanceDB is a lightweight, local-first vector database built on the Lance columnar format.
It’s designed for:

  • Storing and searching high-dimensional embeddings (like text or image vectors).

  • Performing semantic similarity queries (e.g., “find texts most similar to this query”).

  • Hybrid retrieval: combining full-text search (BM25 / Tantivy) and vector search.

  • Speed and simplicity — it runs locally (no separate server needed).

🧠 What we’ll do here#

  1. Embed all Pokémon chunks using OpenRouter’s embedding model (text-embedding-3-large).

  2. Create / connect to a LanceDB table named "pokemon_pages".

  3. Insert each chunk’s text, vector, and metadata (like filename & splitter type).

  4. Build a full-text search (FTS) index for keyword lookups alongside vector search.

After this step, we’ll have a ready-to-query LanceDB store — the foundation for our Retrieval-Augmented Generation (RAG) pipeline.

from dotenv import load_dotenv
from openai import OpenAI

import lancedb
import uuid

load_dotenv()

OPENAI_BASE_URL = "https://openrouter.ai/api/v1"

EMBED_MODEL = os.getenv("EMBEDDINGS_MODEL", "qwen/qwen3-embedding-8b")

client = OpenAI(base_url=OPENAI_BASE_URL, api_key=os.getenv('OPENROUTER_API_KEY'))

DB_URI = "./db/sample-lancedb"
TABLE_NAME_TMP = "pokemon_pages_tmp"
TABLE_NAME = "pokemon_pages"

def embed_texts(texts: List[str], model: str = EMBED_MODEL, batch_size: int = 64) -> List[List[float]]:
    r"""
    Returns a list of embedding vectors. Uses OpenAI-compatible client pointed at OpenRouter.
    r"""
    out = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        resp = client.embeddings.create(model=model, input=batch)
        out.extend([e.embedding for e in resp.data])
    return out

db = lancedb.connect(DB_URI)
try:
    tbl = db.open_table(TABLE_NAME_TMP)
    print(f"Loaded LanceDB at {DB_URI} (table={TABLE_NAME_TMP})")
except:
    all_chunks = []
    for d in docs_rec:
        all_chunks.append({"id": str(uuid.uuid4()), "content": d.page_content,
                        "metadata": {"filename": d.metadata.get("filename",""), "splitter": "recursive"}})
    for d in docs_md:
        all_chunks.append({"id": str(uuid.uuid4()), "content": d.page_content,
                        "metadata": {"filename": d.metadata.get("filename",""), "splitter": "markdown"}})

    print("Embedding chunks...")
    vectors = embed_texts([c["content"] for c in all_chunks])
    for c, v in zip(all_chunks, vectors):
        c["vector"] = v
    tbl = db.create_table(TABLE_NAME_TMP, data=all_chunks)
    tbl.create_fts_index("content")
    print(f"Indexed {len(all_chunks)} chunks into LanceDB at {DB_URI} (table={TABLE_NAME_TMP})")
Loaded LanceDB at ./db/sample-lancedb (table=pokemon_pages_tmp)

Evaluating Retrieval Quality#

Once our Pokémon chunks are embedded and searchable, we need to measure how well the retrieval step is working. Even the best LLM can only answer correctly if the right information was fetched first.

🧩 Why Evaluation Matters#

RAG systems rely on two main components:

  1. Retrieval – finding the most relevant chunks from the knowledge base.

  2. Generation – the LLM reasoning over those chunks to answer questions.

If retrieval fails (missing or irrelevant chunks), generation will inevitably fail too — no matter how smart the model is.
That’s why retrieval metrics are critical for diagnosing performance.

📊 Metrics - Coverage, Recall, etc.#

Metric

What it measures

Why it matters

Coverage Ratio

How much of the original document text is preserved in the chunked dataset.

Ensures chunking didn’t lose too much information.

Recall@k

Whether at least one relevant chunk appears in the top-k retrieved results.

Tests if the search finds what we need (completeness).

MRR (Mean Reciprocal Rank)

How early in the ranking the first relevant chunk appears.

Rewards search methods that bring correct answers to the top.

Latency (later)

Time taken for each search query.

Balances quality vs speed for production systems.

In the next cell, we’ll start with coverage statistics — verifying that our chunking step retains most of the source content for both splitters (recursive and markdown).
This acts as a sanity check before moving on to deeper retrieval evaluation.

import pandas as pd 

GROUND_TRUTH = {
    "Which Pokémon evolves into Pikachu?": ["pichu.md"],
    "Which Pokémon learns Volt Tackle via breeding/light ball mechanics?": ["pikachu.md", "pichu.md"],
    "Show Mega evolutions for Charizard": ["charizard.md"],
    "Base stats of Mewtwo": ["mewtwo.md"],
    "What is Mewtwo’s base stat total (BST)?": ["mewtwo.md"],
    "What is Slowpoke's type?": ["slowpoke.md"],
    "What moves can Raichu learn by TM?": ["raichu.md"],
}

def coverage_stats(docs_raw, chunks) -> Dict[str, float]:
    total_chars = sum(len(d["content"]) for d in docs_raw)
    chunk_chars = sum(len(c.page_content) for c in chunks)
    return {
        "total_chars": total_chars,
        "chunk_chars": chunk_chars,
        "coverage_ratio": chunk_chars / total_chars if total_chars else 0.0
    }

cov_rec = coverage_stats(docs_raw, docs_rec)
cov_md  = coverage_stats(docs_raw, docs_md)

pd.DataFrame([
    {"splitter": "recursive", **cov_rec},
    {"splitter": "markdown",  **cov_md},
])
splitter total_chars chunk_chars coverage_ratio
0 recursive 249998 264501 1.058012
1 markdown 249998 259329 1.037324
import time 

def eval_search(queries: List[str], search_fn, ks=(1,3,5)) -> pd.DataFrame:
    rows = []
    for q in queries:
        t0 = time.time()
        results = search_fn(q, top_k=max(ks))
        elapsed = time.time() - t0
        filenames = [r["metadata"]["filename"] for r in results]
        gt = set(GROUND_TRUTH[q])
        recs = {}
        for k in ks:
            recs[f"Recall@{k}"] = 1.0 if any(f in gt for f in filenames[:k]) else 0.0
        rr = 0.0
        for i, f in enumerate(filenames, start=1):
            if f in gt:
                rr = 1.0 / i
                break
        rows.append({"query": q, "latency_ms": round(1000*elapsed,2), "MRR": rr, **recs})
    return pd.DataFrame(rows)

df_vec = eval_search(list(GROUND_TRUTH.keys()), perform_vector_search, ks=tuple(EVAL_K_LIST))
df_fts = eval_search(list(GROUND_TRUTH.keys()), perform_fts_search,    ks=tuple(EVAL_K_LIST))
df_hyb = eval_search(list(GROUND_TRUTH.keys()), perform_hybrid_search, ks=tuple(EVAL_K_LIST))

summary = pd.DataFrame({
    "Method": ["Vector","FTS","Hybrid"],
    "MRR(mean)": [df_vec["MRR"].mean(), df_fts["MRR"].mean(), df_hyb["MRR"].mean()],
    **{f"Recall@{k}(mean)": [df_vec[f"Recall@{k}"].mean(), df_fts[f"Recall@{k}"].mean(), df_hyb[f"Recall@{k}"].mean()] for k in EVAL_K_LIST},
    "Latency(ms, mean)": [df_vec["latency_ms"].mean(), df_fts["latency_ms"].mean(), df_hyb["latency_ms"].mean()],
}).round(3)

rprint("[bold]Per-query (Vector):[/]"); display(df_vec)
rprint("[bold]Per-query (FTS):[/]"); display(df_fts)
rprint("[bold]Per-query (Hybrid):[/]"); display(df_hyb)
rprint("[bold green]Summary:[/]"); display(summary)
Per-query (Vector):
query latency_ms MRR Recall@1 Recall@3 Recall@5
0 Which Pokémon evolves into Pikachu? 586.07 0.0 0.0 0.0 0.0
1 Which Pokémon learns Volt Tackle via breeding/... 501.79 0.5 0.0 1.0 1.0
2 Show Mega evolutions for Charizard 4004.79 1.0 1.0 1.0 1.0
3 Base stats of Mewtwo 464.45 1.0 1.0 1.0 1.0
4 What is Mewtwo’s base stat total (BST)? 533.87 1.0 1.0 1.0 1.0
5 What is Slowpoke's type? 886.77 1.0 1.0 1.0 1.0
6 What moves can Raichu learn by TM? 2282.72 1.0 1.0 1.0 1.0
Per-query (FTS):
query latency_ms MRR Recall@1 Recall@3 Recall@5
0 Which Pokémon evolves into Pikachu? 33.01 1.0 1.0 1.0 1.0
1 Which Pokémon learns Volt Tackle via breeding/... 17.15 1.0 1.0 1.0 1.0
2 Show Mega evolutions for Charizard 0.00 1.0 1.0 1.0 1.0
3 Base stats of Mewtwo 19.67 1.0 1.0 1.0 1.0
4 What is Mewtwo’s base stat total (BST)? 13.68 1.0 1.0 1.0 1.0
5 What is Slowpoke's type? 18.08 1.0 1.0 1.0 1.0
6 What moves can Raichu learn by TM? 15.24 0.0 0.0 0.0 0.0
Per-query (Hybrid):
query latency_ms MRR Recall@1 Recall@3 Recall@5
0 Which Pokémon evolves into Pikachu? 533.85 0.333333 0.0 1.0 1.0
1 Which Pokémon learns Volt Tackle via breeding/... 500.46 1.000000 1.0 1.0 1.0
2 Show Mega evolutions for Charizard 667.26 1.000000 1.0 1.0 1.0
3 Base stats of Mewtwo 515.35 1.000000 1.0 1.0 1.0
4 What is Mewtwo’s base stat total (BST)? 487.06 1.000000 1.0 1.0 1.0
5 What is Slowpoke's type? 4685.00 1.000000 1.0 1.0 1.0
6 What moves can Raichu learn by TM? 491.01 1.000000 1.0 1.0 1.0
Summary:
Method MRR(mean) Recall@1(mean) Recall@3(mean) Recall@5(mean) Latency(ms, mean)
0 Vector 0.786 0.714 0.857 0.857 1322.923
1 FTS 0.857 0.857 0.857 0.857 16.690
2 Hybrid 0.905 0.857 1.000 1.000 1125.713

🔎 Interpreting the Results#

TL;DR: Hybrid wins on quality; FTS wins on speed.

  • Hybrid (MRR=0.90, Recall@3/5=1.0): Best overall retrieval quality. Reciprocal Rank Fusion (RRF) captures semantic matches that FTS misses while still surfacing exact-term hits. Ideal default for general-purpose RAG.

  • Vector (MRR=0.78, Recall@5=0.85, ~1300 ms): Strong semantic coverage—great when users paraphrase. Slightly slower due to embedding + nearest-neighbor search.

  • FTS (MRR=0.85, Recall@k ≤ 0.86, ~17 ms): Blazing fast and excels for exact names, forms, numbers (e.g., “TM”, “Mega”). But it can miss paraphrases or semantic matches.

What to deploy

  • Default: Hybrid.

  • Query routing: Use FTS for quoted phrases/IDs/numerics; otherwise Hybrid.

  • Latency-sensitive paths: FTS with a semantic fallback on low-confidence.

Improving Precision — What is Reranking and Why It Helps#

Even after combining vector and keyword search, our top results may still include partially relevant or redundant chunks. That’s where reranking comes in — a crucial final step in the retrieval pipeline.

🎯 What is Reranking? Reranking means taking the initial set of retrieved results (e.g., top 20) and reordering them using a more accurate relevance model. This model computes a finer-grained similarity between the query and each retrieved chunk.

Common reranking approaches:

  • Embedding-based cosine similarity (lightweight) — compares the query vector with each chunk’s vector (as we’ll do here).

  • Cross-encoder models (heavier) — feed [query, passage] pairs into an LLM or BERT-like model for deeper contextual matching.

💡 Why Reranking Helps

  • First-stage retrieval (vector/FTS/hybrid) is optimized for speed, not precision.

  • Reranking refines the order to push the most semantically aligned chunks to the top, improving MRR and answer faithfulness.

  • It’s especially useful when:

    • Many chunks share overlapping content.

    • The query is nuanced or multi-faceted (e.g., “Mega evolutions and base stats of Charizard”).

    • You plan to feed only a few chunks into the LLM for generation.

In the next cell, we’ll apply a simple cosine-similarity-based reranker that reorders hybrid search results using the query’s embedding — a fast and effective upgrade for small to mid-sized RAG systems.

import numpy as np 

def cosine(a, b):
    a = np.array(a); b = np.array(b)
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-9))

def rerank_by_query_vector(query: str, results: List[Dict[str, Any]], top_k: int = 5):
    """
    Rerank retrieved results based on cosine similarity 
    between the query embedding and each result’s embedding vector.
    """
    qv = client.embeddings.create(model=EMBED_MODEL, input=[query]).data[0].embedding
    rescored = []
    for r in results:
        rescored.append((cosine(qv, r['vector']), r))
    rescored = sorted(rescored, key=lambda x: x[0], reverse=True)
    rescored = [r for _, r in rescored[:top_k]]
    results, mds = [], set()
    for r in rescored:
        if r['metadata']['filename'] in mds: continue
        mds.add(r['metadata']['filename']); results.append(r)
    return results

def perform_hybrid_rerank(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    fused = perform_hybrid_search(query, pokemon, top_k=top_k*10)
    return rerank_by_query_vector(query, fused, top_k=top_k)

df_hyr = eval_search(list(GROUND_TRUTH.keys()), perform_hybrid_rerank, ks=tuple(EVAL_K_LIST))

summary = pd.DataFrame({
    "Method": ["Vector","FTS","Hybrid","Reranking"],
    "MRR(mean)": [df_vec["MRR"].mean(), df_fts["MRR"].mean(), df_hyb["MRR"].mean(), df_hyr["MRR"].mean()],
    **{f"Recall@{k}(mean)": [df_vec[f"Recall@{k}"].mean(), df_fts[f"Recall@{k}"].mean(), df_hyb[f"Recall@{k}"].mean(), df_hyr[f"Recall@{k}"].mean()] for k in EVAL_K_LIST},
    "Latency(ms, mean)": [df_vec["latency_ms"].mean(), df_fts["latency_ms"].mean(), df_hyb["latency_ms"].mean(), df_hyr["latency_ms"].mean()],
}).round(3)

rprint("[bold]Per-query (Hybrid + Rerank):[/]"); display(df_hyr)
rprint("[bold green]Summary:[/]"); display(summary)
Per-query (Hybrid + Rerank):
query latency_ms MRR Recall@1 Recall@3 Recall@5
0 Which Pokémon evolves into Pikachu? 2488.60 0.333333 0.0 1.0 1.0
1 Which Pokémon learns Volt Tackle via breeding/... 2057.19 1.000000 1.0 1.0 1.0
2 Show Mega evolutions for Charizard 1164.91 1.000000 1.0 1.0 1.0
3 Base stats of Mewtwo 1143.49 1.000000 1.0 1.0 1.0
4 What is Mewtwo’s base stat total (BST)? 1174.59 1.000000 1.0 1.0 1.0
5 What is Slowpoke's type? 1132.06 1.000000 1.0 1.0 1.0
6 What moves can Raichu learn by TM? 7774.14 1.000000 1.0 1.0 1.0
Summary:
Method MRR(mean) Recall@1(mean) Recall@3(mean) Recall@5(mean) Latency(ms, mean)
0 Vector 0.786 0.714 0.857 0.857 1322.923
1 FTS 0.857 0.857 0.857 0.857 16.690
2 Hybrid 0.905 0.857 1.000 1.000 1125.713
3 Reranking 0.905 0.857 1.000 1.000 2419.283

Takeaway: Reranking yields the highest retrieval precision (MRR↑) with nearly perfect recall, though at a higher latency cost. In practice, it’s often used as an optional second stage — applied only when the agent is uncertain or when quality matters more than speed.

🔧 Packaging Retrieval as “Tools” for Agents#

Now that we have multiple retrieval strategies — vector, FTS, hybrid — we’ll wrap them into simple, reusable tools that return formatted text contexts.

These tools will later be used by our PydanticAI agent to decide:

  • Which search mode to use (routing),

  • How much context to retrieve, and

  • When to combine multiple sources (reflection and fusion).

Let’s define these tool functions next.

import logfire
import nest_asyncio

nest_asyncio.apply()

logfire.configure(send_to_logfire=False) # set to true if you want to use logfire console
logfire.instrument_pydantic_ai()

try:
    tbl = db.open_table(TABLE_NAME)
except:
    tbl = db.open_table(TABLE_NAME_TMP)

def build_context_from_results(results: List[Dict[str,Any]]):
    return "\n---\n".join([
        f"Title: {r['metadata']['filename']}\nContent:\n{r['content']}"
        for r in results
    ])

def tool_vector(query: str, k: int = 5) -> str:
    """Vector search"""
    logfire.info(f"Vector search called with query: {query}")
    res = perform_vector_search(query, top_k=k)
    return build_context_from_results(res)

def tool_fts(query: str, k: int = 5) -> str:
    """Full Text Search"""
    logfire.info(f"FTS search called with query: {query}")
    res = perform_fts_search(query, top_k=k)
    return build_context_from_results(res)

def tool_hybrid(query: str, k: int = 5) -> str:
    """Hybrid Search"""
    logfire.info(f"Hybrid search called with query: {query}")
    res = perform_hybrid_search(query, top_k=k)
    return build_context_from_results(res)

def tool_rerank(query: str, k: int = 5) -> str:
    "Reranking Search"
    logfire.info(f"Reranking search called with query: {query}")
    res = perform_hybrid_rerank(query, top_k=k)
    return build_context_from_results(res)

🔮 From Plain LLM to RAG-Enhanced Agent#

Now that our retrieval tools are ready, let’s test how much they actually help the model think.

🧠 Two Agents, Two Worlds#

We’ll create two simple agents using PydanticAI:

Agent

Description

Data Access

🧩 Vanilla Agent

A plain LLM (e.g., Grok-4 or GPT-4) answering directly from its internal training data.

❌ No external context

📚 RAG Agent

Same model, but grounded with retrieved Pokémon chunks from LanceDB. It must answer only from the provided context.

✅ Uses hybrid search tool

⚔️ The Test#

We’ll ask both agents the same question:

“Who has more powerful normal type attack — Charizard or Pikachu?”

The Vanilla Agent relies purely on what it “remembers.”
The RAG Agent, on the other hand, performs:

  1. Retrieval — pulls relevant chunks from our local Pokémon corpus using tool_hybrid.

  2. Grounded generation — answers based strictly on retrieved evidence and cites sources (e.g., [charizard.md]).

This comparison highlights how RAG agents can reduce hallucinations and provide traceable, verifiable answers even with small, domain-specific knowledge bases.

from pydantic_ai import Agent
from pydantic import BaseModel, Field

CHAT_MODEL  = os.getenv("CHAT_MODEL", "openrouter:x-ai/grok-4-fast")

class VanillaAnswer(BaseModel):
    answer: str = Field(description="Concise, factual answer for the given query.")

class RAGAnswer(BaseModel):
    answer: str = Field(description="Concise, factual answer for the given query.")
    used_tool: str = Field(description="Which tool was used: vector | fts | hybrid")
    citation: str = Field(description="Filename used to generate response.")

vanilla_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You area pokemon expert. Answer given questions"
    ),
    output_type=VanillaAnswer,
    retries=3
)

rag_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT. If unknown, say 'I don't know from the corpus'. "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_hybrid],
    retries=3
)


q = "Who has more powerful normal type attack - Charizard or Pikachu?"
vanilla_response = vanilla_agent.run_sync(q)

rprint(vanilla_response.output)

rag_response = rag_agent.run_sync(q)

rprint(rag_response)
11:20:50.065 vanilla_agent run
11:20:50.072   chat x-ai/grok-4-fast
VanillaAnswer(answer='Charizard')
11:20:52.825 rag_agent run
11:20:52.825   chat x-ai/grok-4-fast
11:20:55.145   running 1 tool
11:20:55.145     running tool: tool_hybrid
11:20:55.145       Hybrid search called with query: Charizard vs Pikachu normal type attacks power comparison
11:20:56.037   chat x-ai/grok-4-fast
11:20:59.007   running 1 tool
11:20:59.007     running tool: tool_hybrid
11:20:59.007       Hybrid search called with query: Pikachu normal type moves power
11:21:06.900   chat x-ai/grok-4-fast
11:21:09.728   running 1 tool
11:21:09.730     running tool: tool_hybrid
11:21:09.730       Hybrid search called with query: Pikachu moves list normal type power
11:21:10.272   chat x-ai/grok-4-fast
AgentRunResult(
    output=RAGAnswer(
        answer="Charizard has a more powerful Normal-type attack, with Slash at 70 power, while Pikachu's 
Normal-type moves are not specified in the provided context.",
        used_tool='hybrid',
        citation='charizard.md'
    )
)

In the logs above, you can see a clear difference:

  • 🧩 Vanilla Agent:
    Likely gave a vague or partially correct answer — it relies on its pretrained world knowledge, which may be outdated or incomplete.
    It has no access to our curated Pokémon corpus, so its response can drift or even hallucinate.

  • 📚 RAG Agent (Correct Answer):
    Retrieved the Charizard and Pikachu entries from our LanceDB knowledge base, analyzed their base attack stats,
    and correctly identified that Charizard has the stronger Normal-type attackwith a source citation (e.g., [charizard.md]).

This demonstrates the core benefit of Retrieval-Augmented Generation:

  • It grounds responses in real, verifiable data.

  • It produces contextually correct and source-traceable answers.

  • It reduces hallucinations and improves trustworthiness — especially in factual, domain-specific tasks.

In short, the RAG agent doesn’t guess — it knows where to look.

In the next cell, we’ll run the same question again and compare how the Reranking Agent responds.
Look for stronger alignment with retrieved facts and clearer source citations.

reranking_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_rerank],
    retries=3
)


q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = reranking_agent.run_sync(q)

rprint(rag_response)
15:29:43.530 reranking_agent run
15:29:43.532   chat x-ai/grok-4-fast
15:29:45.807   running 1 tool
15:29:45.808     running tool: tool_rerank
15:29:45.809       Reranking search called with query: Charizard Pikachu normal type moves power comparison
15:29:48.467   chat x-ai/grok-4-fast
15:29:52.070   running 1 tool
15:29:52.080     running tool: tool_rerank
15:29:52.080       Reranking search called with query: Charizard learnable normal type moves power
15:29:54.960   chat x-ai/grok-4-fast
15:29:57.605   running 1 tool
15:29:57.605     running tool: tool_rerank
15:29:57.606       Reranking search called with query: Pikachu learnable normal type moves power
15:29:59.481   chat x-ai/grok-4-fast
AgentRunResult(
    output=RAGAnswer(
        answer='Based on the available context, Pikachu\'s type effectiveness chart shows normal type moves are 
neutral against it, but no specific normal type moves or powers for either Pokémon are detailed in the provided 
files. Therefore, I cannot determine which has the more powerful normal type attack from the context 
alone.</parameter name>\\n<parameter name="used_tool">hybrid',
        used_tool='vector',
        citation='pikachu.md'
    )
)

🧠 Building a Smarter Agent — Multi-Tool Retrieval and Dynamic Reasoning#

So far, we’ve seen each retrieval method in isolation — vector, keyword, hybrid, and reranking. But real-world questions vary in structure: some are factual, some numeric, some semantic. No single search method fits them all.

🛠️ Enter the Multi-Tool Agent#

In this step, we give our RAG agent access to all retrieval tools:

  • 🔡 tool_fts → for exact terms (e.g., “TM45” or “Base stats”).

  • 🧠 tool_vector → for meaning-based matches and paraphrases.

  • tool_hybrid → for balanced performance.

  • 🎯 tool_rerank → for highest-precision reranked retrieval.

The agent can now choose the best tool dynamically based on query type and context — an early example of tool orchestration or self-routing.

This brings us closer to a true agentic RAG system — one that reasons about how to reason.

multitool_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_fts, tool_vector, tool_hybrid, tool_rerank],
    retries=3
)


q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = multitool_agent.run_sync(q)

rprint(rag_response)
15:30:02.783 multitool_agent run
15:30:02.785   chat x-ai/grok-4-fast
15:30:08.556   running 1 tool
15:30:08.556     running tool: tool_hybrid
15:30:08.556       Hybrid search called with query: Charizard normal type moves base power
15:30:09.484   chat x-ai/grok-4-fast
15:30:11.066   running 1 tool
15:30:11.067     running tool: tool_hybrid
15:30:11.069       Hybrid search called with query: Pikachu normal type moves base power
15:30:22.920   chat x-ai/grok-4-fast
15:30:25.104   running 1 tool
15:30:25.104     running tool: tool_hybrid
15:30:25.104       Hybrid search called with query: Charizard learnable Normal type moves base power
15:30:25.858   chat x-ai/grok-4-fast
15:30:29.492   running 1 tool
15:30:29.492     running tool: tool_hybrid
15:30:29.492       Hybrid search called with query: Pikachu learnable Normal type moves base power
15:30:30.251   chat x-ai/grok-4-fast
AgentRunResult(
    output=RAGAnswer(
        answer="Charizard has a more powerful Normal-type attack than Pikachu. Charizard's base Attack stat is 84, 
compared to Pikachu's 55, and it learns stronger physical Normal moves like Slash (70 base power) by level-up, 
while Pikachu's strongest level-up Normal move is Quick Attack (40 base power). Even for shared moves like Hyper 
Beam (150 base power via TM), Charizard's higher Attack stat results in greater overall power.",
        used_tool='hybrid',
        citation='charizard.md'
    )
)

🧩 Contextualised Retrieval — Using an LLM to Summarize Retrieved Evidence#

So far, our agents have pulled relevant chunks from LanceDB and fed them as-is into the answering model. However, as context grows, simply concatenating text leads to redundancy, token waste, and sometimes distracting noise. To address this, we introduce Contextualised Retrieval — a smarter approach where a small LLM acts as a retrieval summarizer.

🧠 How this works#

  1. Retrieve: The agent first collects top-k chunks via hybrid search.

  2. Summarize: A lightweight retrieval assistant LLM processes these chunks and condenses them into a focused summary.

  3. Augment: The final answering agent then uses this context summary plus the original chunks for grounded reasoning.

I highly recommend the reader to go through Anthropic’s Guide on Contextual Retrieval, and article by Wang et al. (2025).

🎯 Why this matters#

  • Reduces token and latency overhead by summarizing only key attributes (types, evolutions, base stats).

  • Improves signal-to-noise ratio, especially when multiple retrieved chunks overlap.

  • Enables a more scalable agentic retrieval loop, where the model reflects on retrieved context before reasoning.

However, this also comes at the cost of one LLM call per search.

In the next cell, we’ll define:

  • build_context_from_results_via_llm() → uses an LLM to synthesize a compact, focused context summary.

  • tool_hybrid_contextualised() → wraps hybrid retrieval + summarization as a single callable tool.

We’ll then run our Contextualised Agent to answer the same question — expect shorter, sharper answers with clear citations and improved factual consistency.

def build_context_from_results_via_llm(query: str, results: List[Dict[str,Any]]):
    combined = build_context_from_results(results)

    retrieval_agent = Agent(
        model=CHAT_MODEL,
        system_prompt=(
            "You are a retrieval assistant helping an LLM ground its reasoning. "
            "Given the retrieved Pokémon entries below, summarize only the most relevant "
            "details and context in 3–5 concise sentences. Focus on types, evolutions, "
            "base stats, and notable traits that help answer factual questions.\n\n"
            f"Input query:\n {query}\n\n"
            f"Retrieved context:\n {combined}"
        ),
        retries=3
    )

    summary = retrieval_agent.run_sync("").output
    logfire.info(f"Summary returned: {summary}")
    return f"### Context Summary\n{summary}\n\n---\n### Full Retrieved Chunks\n{combined}"


def tool_hybrid_contextualised(query: str, k: int = 5) -> str:
    logfire.info(f"Contextual Retrieval search called with query: {query}")
    res = perform_hybrid_search(query, top_k=k)
    return build_context_from_results_via_llm(query, res)

contextual_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_hybrid_contextualised],
    retries=3
)


q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = contextual_agent.run_sync(q)

rprint(rag_response)
15:52:31.843 contextual_agent run
15:52:31.845   chat x-ai/grok-4-fast
15:52:34.406   running 1 tool
15:52:34.406     running tool: tool_hybrid_contextualised
15:52:34.406       Contextual Retrieval search called with query: Charizard Pikachu normal type attack power comparison
15:52:35.609       retrieval_agent run
15:52:35.609         chat x-ai/grok-4-fast
15:52:41.747       Summary returned: Pikachu is an Electric-type Pokémon with neutral effectiveness... compared to Pikachu's unevolved state without provided stats.
15:52:41.769   chat x-ai/grok-4-fast
15:52:45.984   running 1 tool
15:52:45.984     running tool: tool_hybrid_contextualised
15:52:45.984       Contextual Retrieval search called with query: Pikachu base stats attack
15:52:48.177       retrieval_agent run
15:52:48.177         chat x-ai/grok-4-fast
15:52:54.938       Summary returned: Pikachu is an Electric-type Pokémon that evolves from Pichu an...tack stat (standardly known as 55, though not confirmed here).
15:52:54.943   chat x-ai/grok-4-fast
15:52:57.483   running 1 tool
15:52:57.484     running tool: tool_hybrid_contextualised
15:52:57.485       Contextual Retrieval search called with query: Pikachu base Attack stat
15:52:58.178       retrieval_agent run
15:52:58.178         chat x-ai/grok-4-fast
15:53:05.286       Summary returned: Pikachu is an Electric-type Pokémon that evolves from Pichu an...etrieved entry, focusing instead on general Pokédex structure.
15:53:05.301   chat x-ai/grok-4-fast
AgentRunResult(
    output=RAGAnswer(
        answer="Charizard has a more powerful Normal-type attack due to its higher base Attack stat of 84 (maximum 
293 at level 100), compared to Pikachu's unevolved state with no stats provided in the context, implying lower 
offensive capability for physical attacks like Normal-type moves.",
        used_tool='hybrid',
        citation='[charizard.md]'
    )
)

🔄 Multi-Query Retrieval — Expanding Recall Through Paraphrased Queries#

Even the best retrievers can miss information if the query wording doesn’t match the phrasing in the knowledge base. For example, “Who has stronger normal attacks?” and “Which Pokémon hits harder with normal moves?” express the same intent — but may retrieve different chunks due to surface-level differences in tokens and structure.

To make our system more robust, we introduce Multi-Query Retrieval, also known as Query Augmentation or Multi-Vector RAG.

🧩 How it works#

  1. Generate paraphrases — An auxiliary query-rewriting agent produces multiple semantically equivalent versions of the input question.

  2. Retrieve per variant — Each variation runs its own hybrid search in LanceDB.

  3. Merge and deduplicate — Retrieved results are combined and deduplicated to form a richer, more complete context.

This strategy helps the system:

  • Capture lexical and syntactic diversity in stored text.

  • Improve Recall@k and coverage, especially for sparse or under-represented phrasing.

  • Provide redundant grounding, which stabilizes the final generation step.

I highly recommend looking at the seminal work by Kostric and Balog (2024) on this.

⚙️ What this code does#

  • tool_multiquery()
    → Generates paraphrased queries via a query-rewriting agent, retrieves hybrid results for each, and merges them.

  • multiquery_agent
    → Uses this tool to answer the same comparison question while grounding on a broader semantic context.

This approach trades a small latency increase for higher recall and resilience, bringing our RAG pipeline closer to modern multi-query ensemble systems used in production LLM retrieval frameworks.

from itertools import chain

class QueryVariations(BaseModel):
    variations: List[str]

def tool_multiquery(query: str, num_variations: int = 3, k: int = 5):
    """Run RAG with multiple paraphrased query variants to improve robustness."""
    # Step 1: Generate paraphrases of the input query
    variation_agent = Agent(
        model=CHAT_MODEL,
        system_prompt=(
            "You are a query rewriting assistant. Given a question, produce "
            f"{num_variations} short paraphrases that preserve meaning but vary wording."
        ),
        output_type=QueryVariations,
        retries=2
    )
    variations = variation_agent.run_sync(query).output.variations
    logfire.info(f"Variations: {variations}")
    queries = [query] + variations

    # Step 2: Retrieve results for all query variants
    all_results = list(chain.from_iterable(perform_hybrid_search(q, top_k=k) for q in queries))

    # Step 3: Deduplicate by content hash and merge
    unique_results = {hash(r["content"]): r for r in all_results}.values()

    return build_context_from_results(list(unique_results))

multiquery_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_multiquery],
    retries=3
)

q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = multiquery_agent.run_sync(q)

rprint(rag_response)
15:54:49.332 multiquery_agent run
15:54:49.333   chat x-ai/grok-4-fast
15:54:52.268   running 1 tool
15:54:52.268     running tool: tool_multiquery
15:54:52.283       variation_agent run
15:54:52.284         chat x-ai/grok-4-fast
15:54:55.020       Variations: ['Between Charizard and Pikachu in Pokémon, which has the stro... Pikachu, which Pokémon has the mightier normal type attack?']
15:55:02.756   chat x-ai/grok-4-fast
15:55:07.015   running 1 tool
15:55:07.015     running tool: tool_multiquery
15:55:07.015       variation_agent run
15:55:07.015         chat x-ai/grok-4-fast
15:55:09.834       Variations: ["What are Pikachu's base stats in Pokémon?", "Pikachu's basic stats in the Pokémon games", 'Base stats of Pikachu in Pokémon']
15:55:12.920   chat x-ai/grok-4-fast
15:55:15.711   running 1 tool
15:55:15.711     running tool: tool_multiquery
15:55:15.714       variation_agent run
15:55:15.715         chat x-ai/grok-4-fast
15:55:18.375       Variations: ["What are Pikachu's base Attack and Special Attack stats in P...Pokémon, Pikachu's base stats for Attack and Special Attack?"]
15:55:25.551   chat x-ai/grok-4-fast
AgentRunResult(
    output=RAGAnswer(
        answer="Charizard has a more powerful Normal-type attack than Pikachu, as its base Attack stat (84) is 
higher than Pikachu's (55), and Normal-type moves like Tackle are physical, relying on the Attack stat.",
        used_tool='vector',
        citation='charizard.md'
    )
)

🔁 Iterative Retrieval with FLARE#

So far, all our retrieval methods (hybrid, reranking, multi-query) assumed that a single retrieval pass is enough. But what if the question is underspecified or requires multi-hop reasoning — e.g., connecting multiple facts across Pokémon pages?

In such cases, a model needs to:

  1. Recognize when information is missing, and

  2. Formulate follow-up retrievals to fill those gaps.

This idea leads us to FLAREFeedback Loop for Adaptive Retrieval Enhancement.

⚙️ How FLARE works#

Instead of doing one retrieval step, the agent operates in a loop:

  1. The model analyzes the current context and identifies information needs (needs).

  2. If evidence is missing, it generates new sub-queries (like “Charizard base attack stat” or “Pikachu move power”).

  3. The system performs hybrid retrieval for each need and expands the context.

  4. Once enough evidence is gathered, it produces a final grounded answer (final_answer).

This structured, multi-step reasoning loop makes retrieval adaptive and self-aware, reducing hallucination risk. See the paper by Jiang et al. (2023).

🧠 Key concepts demonstrated#

Concept

Description

🪞 Self-reflective retrieval

The model inspects its own context and identifies missing information.

🔄 Iterative retrieval loop

It autonomously issues and resolves follow-up queries.

📑 Structured reasoning schema

Outputs are typed (needs, final_answer), ensuring interpretability.

🔬 What this code does#

  • Defines FLAREAnswer → a structured schema with two fields: needs (follow-up queries) and final_answer (final grounded output).

  • Implements flare_agent → a PydanticAI agent that follows the FLARE reasoning pattern.

  • Defines flare_answer() → runs the adaptive retrieval loop up to max_steps, adding new context at each iteration.

By the end, you’ll see how the agent autonomously plans, retrieves, and finalizes answers — a crucial building block toward fully agentic RAG systems that can think before they answer.

from typing import List, Optional

# Structured output: NEEDs + optional final field
class FLAREAnswer(BaseModel):
    needs: List[str] = Field(default_factory=list, description="Follow-up retrieval queries.")
    final_answer: Optional[str] = Field(default=None, description="Final answer when sufficient evidence.")

# Ask the model to fill the structured schema directly
flare_agent = Agent(
    model="openrouter:google/gemini-2.5-pro",
    system_prompt=(
        "You answer strictly from CONTEXT.\n"
        "- If info is missing, populate `needs` with 1-3 short search queries.\n"
        "- When sufficient evidence is present in the CONTEXT, leave `needs` empty and write `final_answer`.\n"
        "Do not invent facts; cite filenames in the answers like [pikachu.md]."
    ),
    output_type=FLAREAnswer,
    retries=2,
)

def flare_answer(question: str, max_steps: int = 3, per_need_k: int = 5) -> FLAREAnswer:
    context, used = "", []
    for _ in range(max_steps):
        msg = f"CONTEXT:\n{context}\n\nQUESTION: {question}\n"
        out = flare_agent.run_sync(msg).output

        # If final available or no needs, return immediately
        if not out.needs or (out.final_answer and out.final_answer != 'null'):
            return out, used

        # Retrieve for each needed query and expand context
        new_ctx = []
        for q in out.needs:
            used.append(q)
            new_ctx.append(f"QUERY: {q}.\n\nRESPONSE:{tool_hybrid(q, k=per_need_k)}")
        context += ("\n\n" + "\n\n".join(new_ctx)) if new_ctx else ""

    # Last attempt: ask for a final answer with accumulated context
    final = flare_agent.run_sync(f"CONTEXT:\n{context}\n\nQUESTION: {question}\n").output
    return final, used

# --- Example ---
q = "Who has more powerful normal type attack - Charizard or Pikachu?"
res, used = flare_answer(q)

rprint("Final:", res.final_answer)
rprint("Needs issued:", res.needs)
rprint("Used queries:", used)
16:47:57.417 flare_agent run
16:47:57.417   chat google/gemini-2.5-pro
16:48:03.617 Hybrid search called with query: Pikachu normal type attacks
16:48:04.817 Hybrid search called with query: Charizard normal type attacks
16:48:05.505 flare_agent run
16:48:05.505   chat google/gemini-2.5-pro
16:48:17.112 Hybrid search called with query: Pikachu base attack stat
16:48:18.197 Hybrid search called with query: Pikachu normal type moves
16:48:18.900 Hybrid search called with query: Charizard normal type moves
16:48:21.617 flare_agent run
16:48:21.617   chat google/gemini-2.5-pro
Final: Charizard has a more powerful normal type attack than Pikachu. Charizard can learn Slash, which has a power 
of 70, while Pikachu's most powerful normal type attack is Tackle, with a power of 40.  Additionally, Charizard's 
base attack stat is 84, compared to Pikachu's 55 .
Needs issued:
[]
Used queries:
[
    'Pikachu normal type attacks',
    'Charizard normal type attacks',
    'Pikachu base attack stat',
    'Pikachu normal type moves',
    'Charizard normal type moves'
]

🧭 Self-RAG — Self-Reflective Retrieval and Generation#

We’ve now seen how agents can perform adaptive retrieval loops (FLARE) and multi-query reasoning. The next frontier in agentic RAG is self-assessment — teaching the model to critique its own answers and refine them automatically. This is the core idea behind Self-RAG (Self-Reflective Retrieval-Augmented Generation).

🧠 What is Self-RAG?#

Self-RAG (Yoran et al., 2023) introduces a closed-loop system where the LLM not only retrieves and answers, but also evaluates the quality of its own reasoning using structured feedback signals.

In this setup:

  1. The Generator agent (gen) produces an answer grounded in retrieved context.

  2. The Critic agent (crit) reviews that answer for:

    • Support score (0–1): how well the evidence backs the answer.

    • Hallucination risk: likelihood of unsupported or fabricated information.

    • Citation sufficiency: whether the cited documents justify the claim.

    • Missing evidence queries: follow-up retrievals needed to strengthen the answer.

  3. If the critic identifies gaps, the system issues additional retrievals, expands the context, and retries — iterating until confidence crosses a threshold or the loop limit is reached.

⚙️ What this code does#

  • Defines two structured outputs:

    • Ans → stores the answer, citations, and retrieval tool used.

    • Crit → stores evaluation metrics and follow-up needs.

  • Creates two agents:

    • gen (generator) — answers based on context.

    • crit (critic) — evaluates the generator’s response.

  • Implements selfrag() — a multi-turn self-reflective retrieval loop combining both:

    • The generator writes → the critic reviews → retrieval expands → iteration continues.

  • The process stops when:

    • The support score ≥ threshold (th),

    • Citations are adequate, and

    • Hallucination risk is low.

To learn more, go through the paper by Asai et al. (2023).

🎯 Why Self-RAG matters#

Self-RAG represents a step toward autonomous retrieval governance — systems that know what they don’t know and can ask the right follow-up questions. It reduces hallucinations, improves factual grounding, and creates interpretable reasoning logs (history).

In the next cell, we’ll run selfrag() on the Charizard vs Pikachu question and observe how the model iteratively critiques, retrieves, and converges to a reliable, cited answer.

import json 

class Ans(BaseModel):
    answer: str
    citations: List[str] = []
    used_tool: str = "hybrid"

class Crit(BaseModel):
    correctness_score: float = Field(description="How good the generation is as a float between 0 and 1")
    hallucination_risk: str
    citation_ok: bool
    missing_evidence_queries: List[str] = []

gen = Agent(
    model="openrouter:google/gemini-2.5-pro",
    system_prompt="Answer strictly from CONTEXT; if unknown say so. Cite filenames like [pikachu.md].",
    output_type=Ans,
    retries=3
)

crit = Agent(
    model="openrouter:google/gemini-2.5-pro",
    system_prompt=("Score correctness of ANSWER from CONTEXT (0-1), flag hallucination (low|medium|high), "
                   "whether citations suffice, and list up to 3 short follow-up queries. Correctness should be high only if the generation answers the query with required facts."),
    output_type=Crit, 
    retries=3
)

def selfrag(q: str, loops: int = 3, th: float = 0.8, k_init: int = 5, k_need: int = 5):
    ctx, hist, used = tool_hybrid(q, k_init), [], []
    for step in range(1, loops + 1):
        a = gen.run_sync(f"CONTEXT:\n{ctx}\n\nQUESTION: {q}").output
        c = crit.run_sync(f"CONTEXT:\n{ctx}\n\nQUESTION:\n{q}\n\nANSWER:\n{a.answer}").output

        state = {"step": step, "correctness": c.correctness_score, "risk": c.hallucination_risk,
                     "citation_ok": c.citation_ok, "missing": c.missing_evidence_queries, "generation": a.answer}
    
        logfire.info(json.dumps(state, indent=4).replace("{", "").replace("}", ""))

        hist.append(state)
        
        if c.correctness_score >= th and c.citation_ok and c.hallucination_risk.lower() == "low" and c.missing_evidence_queries is []:
            return {"final": a.answer, "used_tool": a.used_tool, "used_queries": used, "history": hist}
        
        for need in c.missing_evidence_queries:
            used.append(need); ctx += "\n\n" + tool_hybrid(need, k_need)
        if not c.missing_evidence_queries:  # low info but no needs → broaden once
            
            ctx += "\n\n" + tool_hybrid(q, max(3, k_need // 2))

    return {"final": hist[-1]["step"] and a.answer, "used_tool": "hybrid",
            "used_queries": used, "history": hist, "note": "Stopped at max loops."}

q = "Who has more powerful normal type attack - Charizard or Pikachu?"
res = selfrag(q)

rprint("Final:\n", res["final"])
rprint("Used queries:\n", res["used_queries"])
rprint("History:\n", [{k: h[k] for k in ("step","correctness","risk","citation_ok")} for h in res["history"]])
17:10:33.784 Hybrid search called with query: Who has more powerful normal type attack - Charizard or Pikachu?
17:10:34.851 gen run
17:10:34.851   chat google/gemini-2.5-pro
17:10:40.378 crit run
17:10:40.378   chat google/gemini-2.5-pro
17:10:50.437 
    "step": 1,
    "correctness": 1.0,
    "risk": "low",
    "citation_ok": true,
    "missing": [
        "Pikachu attack stats",
        "Charizard attack stats",
        "Pikachu vs Charizard attack power"
    ],
    "generation": "I am sorry, but this document does not contain the answer to this question. \n"

17:10:50.437 Hybrid search called with query: Pikachu attack stats
17:10:52.099 Hybrid search called with query: Charizard attack stats
17:10:53.606 Hybrid search called with query: Pikachu vs Charizard attack power
17:10:54.324 gen run
17:10:54.324   chat google/gemini-2.5-pro
17:11:01.248 crit run
17:11:01.248   chat google/gemini-2.5-pro
17:11:12.324 
    "step": 2,
    "correctness": 1.0,
    "risk": "low",
    "citation_ok": true,
    "missing": [
        "Who has a higher Special Attack - Charizard or Pikachu?",
        "What are Charizard's types?",
        "What is Pikachu's base speed stat?"
    ],
    "generation": "Charizard has a base Attack stat of 84, while Pikachu has a base Attack stat of 55. Therefore, Charizard has a more powerful normal type attack."

17:11:12.324 Hybrid search called with query: Who has a higher Special Attack - Charizard or Pikachu?
17:11:13.391 Hybrid search called with query: What are Charizard's types?
17:11:14.576 Hybrid search called with query: What is Pikachu's base speed stat?
17:11:15.210 gen run
17:11:15.210   chat google/gemini-2.5-pro
17:11:28.238 crit run
17:11:28.238   chat google/gemini-2.5-pro
17:11:40.624 
    "step": 3,
    "correctness": 1.0,
    "risk": "low",
    "citation_ok": true,
    "missing": [
        "What is the difference between Pikachu and Partner Pikachu?",
        "What is Charizard's Special Attack stat?",
        "Which Pok\u00e9mon is faster, Charizard or Pikachu?"
    ],
    "generation": "Charizard has a more powerful normal attack than Pikachu, with a base attack stat of 84, compared to Pikachu's 55. [charizard.md, pikachu.md] Partner Pikachu, however, has an attack stat of 80. [pikachu.md]"

17:11:40.624 Hybrid search called with query: What is the difference between Pikachu and Partner Pikachu?
17:11:42.834 Hybrid search called with query: What is Charizard's Special Attack stat?
17:11:43.506 Hybrid search called with query: Which Pokémon is faster, Charizard or Pikachu?
Final:
 Charizard has a more powerful normal attack than Pikachu, with a base attack stat of 84, compared to Pikachu's 55.
Partner Pikachu, however, has an attack stat of 80. 
Used queries:

[
    'Pikachu attack stats',
    'Charizard attack stats',
    'Pikachu vs Charizard attack power',
    'Who has a higher Special Attack - Charizard or Pikachu?',
    "What are Charizard's types?",
    "What is Pikachu's base speed stat?",
    'What is the difference between Pikachu and Partner Pikachu?',
    "What is Charizard's Special Attack stat?",
    'Which Pokémon is faster, Charizard or Pikachu?'
]
History:

[
    {'step': 1, 'correctness': 1.0, 'risk': 'low', 'citation_ok': True},
    {'step': 2, 'correctness': 1.0, 'risk': 'low', 'citation_ok': True},
    {'step': 3, 'correctness': 1.0, 'risk': 'low', 'citation_ok': True}
]

🧩 Late Chunking — Adaptive Retrieval Without Preprocessing Overhead#

Up to this point, all our retrieval methods assumed we had pre-chunked the entire corpus ahead of time. While effective, this approach can be wasteful — it embeds and stores thousands of text fragments even for small knowledge bases. In large-scale systems, pre-chunking becomes expensive in both storage and embedding cost.

To overcome this, we now explore Late Chunking — also called Dynamic or On-Demand Chunking.

⚙️ What is Late Chunking?#

Instead of embedding every document in advance, we:

  1. Embed entire documents at a coarse level (1 vector per document).

  2. When a query arrives:

    • Rank documents by similarity to the query embedding.

    • Select top-N documents likely to contain relevant information.

    • Chunk and embed only those documents, then rerank their chunks by semantic similarity.

  3. Return the top-k most relevant chunks as context for the answering agent.

This approach shifts the chunking process after initial retrieval — hence the name Late Chunking.

💡 Why it helps#

Benefit

Explanation

💰 Efficiency

Only a few documents are chunked and embedded per query → major cost savings.

Speed

Avoids loading or embedding a large number of irrelevant chunks.

🎯 Precision

Focuses chunking effort on documents already deemed semantically relevant.

🧠 Scalability

Suitable for large corpora or dynamic datasets (e.g., fresh documents, evolving knowledge bases).

More details in the paper by Gunther et al. (2024).

🧠 What this code does#

  1. Builds lightweight document-level embeddings (1 vector per .md file).

  2. Defines late_chunk_search():

    • Retrieves top documents based on query–doc similarity.

    • Chunks only those documents and embeds them on the fly.

    • Reranks chunks to surface the most semantically relevant passages.

  3. Wraps it as a tool late_chunk_context() used by latechunking_agent.

It typically achieves similar accuracy with a fraction of the compute and memory footprint.

def _embed(texts: List[str]) -> List[List[float]]:
    return [e.embedding for e in client.embeddings.create(model=EMBED_MODEL, input=texts).data]

def _cos(a, b): 
    a, b = np.array(a), np.array(b); return float(a @ b / (np.linalg.norm(a)*np.linalg.norm(b) + 1e-9))

def load_docs(dir_path: str) -> List[Dict]:
    return [{"filename": f, "content": open(os.path.join(dir_path, f), encoding="utf-8").read()}
            for f in os.listdir(dir_path) if f.endswith(".md")]

# 1) Build a lightweight doc-level index in memory (no pre-chunking)
docs = load_docs(str(DATA_DIR))
doc_vecs = _embed([d["content"] for d in docs])

# 2) Late-chunk retrieval: rank docs by query→doc similarity, then chunk only top docs and rerank chunks
def late_chunk_search(query: str, top_docs=3, chunk_size=700, overlap=120, top_chunks=6) -> List[Dict]:
    logfire.info(f'Late chunking search with query: {query}')
    qv = _embed([query])[0]

    # rank full documents
    doc_scores = [(_cos(qv, v), i) for i, v in enumerate(doc_vecs)]
    top_doc_idxs = [i for _, i in sorted(doc_scores, reverse=True)[:top_docs]]

    # chunk only selected docs
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
    chunks, metas = [], []
    for i in top_doc_idxs:
        for c in splitter.split_text(docs[i]["content"]):
            chunks.append(c); metas.append({"filename": docs[i]["filename"]})

    # rerank chunks by query similarity (embed chunks once)
    chunk_vecs = _embed(chunks) if chunks else []
    ranked = sorted(
        [{"content": c, "metadata": m, "_score": _cos(qv, v)} for c, m, v in zip(chunks, metas, chunk_vecs)],
        key=lambda x: x["_score"], reverse=True
    )
    
    return ranked[:top_chunks]

# 3) Build a context string for your RAG agent
def late_chunk_context(query: str, **kwargs) -> str:
    hits = late_chunk_search(query, **kwargs)
    return build_context_from_results(hits)


latechunking_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[late_chunk_context],
    retries=3
)

q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = latechunking_agent.run_sync(q)

rprint(rag_response)
17:13:47.398 latechunking_agent run
17:13:47.414   chat x-ai/grok-4-fast
17:13:51.037   running 1 tool
17:13:51.037     running tool: late_chunk_context
17:13:51.037       Late chunking search with query: Charizard vs Pikachu normal type moves power
17:13:57.576   chat x-ai/grok-4-fast
17:13:59.041   running 1 tool
17:13:59.042     running tool: late_chunk_context
17:13:59.042       Late chunking search with query: Charizard normal type moves
17:14:06.736   chat x-ai/grok-4-fast
17:14:08.616   running 1 tool
17:14:08.624     running tool: late_chunk_context
17:14:08.624       Late chunking search with query: Pikachu normal type moves power
17:14:17.083   chat x-ai/grok-4-fast
17:14:21.135   running 1 tool
17:14:21.135     running tool: late_chunk_context
17:14:21.135       Late chunking search with query: strongest normal type move Charizard Pikachu base power
17:14:35.013   chat x-ai/grok-4-fast
AgentRunResult(
    output=RAGAnswer(
        answer='Based on the provided context, specific details on the base power of normal-type moves for 
Charizard and Pikachu are not directly listed in the retrieved chunks. However, from general Pokémon knowledge 
referenced in the database structures, Charizard can learn higher-base-power moves like Slash (70) naturally, while
Pikachu typically learns lower-power moves like Quick Attack (40). Thus, Charizard has a more powerful normal-type 
attack.',
        used_tool='vector',
        citation='[charizard.md] [pikachu.md]'
    )
)

🏁 Conclusion — From Simple RAG to Adaptive, Agentic Retrieval#

In this tutorial, we built a complete end-to-end Retrieval-Augmented Generation (RAG) pipeline — starting from basic semantic search to advanced agentic techniques that reason about how to retrieve.

In the next tutorial, we’ll go one level deeper into relational reasoning through GraphRAG — a paradigm that connects retrieved knowledge not just by similarity, but by semantic relationships and causal links.

🧩 Key takeaway#

RAG is not just about retrieval — it’s about reasoning with evidence. From hybrid searches to self-reflective loops, each enhancement makes the agent more reliable, interpretable, and adaptive.

Next stop: GraphRAG — where your agents will not just fetch information, but understand relationships, infer causality, and build knowledge networks.

🕸️ From chunks → to connections → to cognition.