04. Retrieval-Augmented Generation

04. Retrieval-Augmented Generation#

Large Language Models (LLMs) are brilliant generalists — they’ve read the internet and can reason across domains — but they don’t know what they haven’t seen. Their parameters store general knowledge, not private, up-to-date, or domain-specific facts. Retrieval-Augmented Generation (RAG) bridges that gap. It combines:

Retrieval – find relevant information from an external knowledge base (e.g., docs, databases, websites).
Generation – pass that retrieved context into an LLM to ground its answer.

This simple loop — retrieve → augment → generate — makes the model:

More accurate (uses real facts, not hallucinations)
More current (retrieval can include recent or proprietary data)
Cheaper & smaller (you don’t need to fine-tune large models for every dataset)
Explainable (you can trace answers back to the retrieved sources)

RAG is now the foundation of modern enterprise AI systems, powering products like search-chat hybrids, coding copilots, knowledge assistants, and customer-support bots. In short: RAG makes LLMs grounded, trustworthy, and useful in the real world.

I highly recommend watching explanations of RAG from IBM and Cole Medin.

Scenario: Pokémon!#

Let’s take something seemingly simple — asking questions about Pokémon species like Pikachu, Charizard, or Mewtwo. At first glance, LLMs might seem to know this, but there are hidden challenges:

Problem	Why it’s hard for an LLM
Data freshness	Game mechanics, move sets, and forms change with every generation — LLMs trained on older data may be outdated.
Structured facts	Evolution trees, base stats, and type matchups are stored in tables, not prose — hard for models to memorize precisely.
Ambiguity	Words like “form”, “Mega Evolution”, “TM”, or “base stats” require domain-specific interpretation.
Compositional queries	“Which Pokémon evolves into Pikachu?” or “List Charizard’s Mega forms and their base stats” require multiple lookups and reasoning steps.

When we ask these zero-shot, even the best LLMs often hallucinate:

inventing fake evolution lines,
mixing up stats across generations,
or returning vague, generic answers.

That’s where RAG shines:

We retrieve the real Pokémon data (from pokemondb.net in this tutorial).
We chunk and embed those markdown pages in a vector database (LanceDB).
Then, for each query, we retrieve the most relevant chunks and let the LLM reason grounded in evidence.

So instead of guessing, our agent reads and reasons. This setup scales naturally to enterprise settings — from Pokémon encyclopedias to product catalogs, regulatory documents, or customer knowledge bases.

For our data, we use the PokemonDB. We’ll fetch: pichu, pikachu, raichu, charizard, mewtwo, slowpoke and save as .md. These pages are HTML; we’ll convert to Markdown for easier chunking.

import requests, pathlib
from markdownify import markdownify as mdify

# Saving data for common pokemons
POKEMON = [
    ("pichu",     "https://pokemondb.net/pokedex/pichu"),
    ("pikachu",   "https://pokemondb.net/pokedex/pikachu"),
    ("raichu",    "https://pokemondb.net/pokedex/raichu"),
    ("charizard", "https://pokemondb.net/pokedex/charizard"),
    ("mewtwo",    "https://pokemondb.net/pokedex/mewtwo"),
    ("slowpoke",  "https://pokemondb.net/pokedex/slowpoke"),
]

def fetch_markdown(url: str) -> str:
    html = requests.get(url, timeout=30).text
    md = mdify(html, heading_style="ATX")
    return md

DATA_DIR = pathlib.Path("./data/pokemon_md")

downloaded = []
for name, url in POKEMON:
    md_text = fetch_markdown(url)
    path = DATA_DIR / f"{name}.md"
    path.write_text(md_text, encoding="utf-8")
    downloaded.append((name, str(path), url))

print(f"Saved {len(downloaded)} markdown files → {DATA_DIR}")

Saved 6 markdown files → data\pokemon_md

Let’s see what a sample of this data page looks like.

from IPython.display import Markdown, display
import pathlib

md_path = pathlib.Path("./data/pokemon_md/pikachu.md")
display(Markdown(md_path.read_text(encoding="utf-8")[3000:4000]))  # first 2000 chars

Additional artwork

Pokédex data

Preparing our Knowledge Base#

Now that we’ve downloaded Pokémon data as .md files (for Pikachu, Charizard, Mewtwo, etc.),
we need to split the text into smaller chunks before embedding it into a vector database.

Why?

LLMs and embeddings have context limits — we can’t feed the entire document at once.
Smaller, semantically coherent chunks help retrieval systems match relevant sections precisely.
Chunking also improves Recall@k, latency, and embedding reuse during updates.

We’ll try two common splitting strategies:

Splitter	Description	When to use
🧩 RecursiveCharacterTextSplitter	Splits text purely by length, preserving overlap.	Generic text without structure.
🧱 MarkdownHeaderTextSplitter	Splits along Markdown headers (`#`, `##`, `###`), then limits size.	Structured content (docs, wikis, pages like Pokémon DB).

After chunking, we’ll have two parallel sets of documents:

docs_rec: recursively chunked plain text
docs_md: structure-aware markdown chunks

These will later be embedded into LanceDB and compared for retrieval quality.

from typing import List, Dict, Any, Optional, Tuple
import os

from langchain_text_splitters import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)

# --- Chunking params ---
CHUNK_SIZE = 700
CHUNK_OVERLAP = 120

# --- Eval/profiling ---
EVAL_K_LIST = [1, 3, 5]
EMBEDDING_COST_PER_1K = float(os.getenv("EMBED_COST_PER_1K", "0.00013"))  # USD
PRINT_TOP_N = 5

def read_files_as_object_array(directory_path: str) -> List[Dict[str, str]]:
    out = []
    for fname in os.listdir(directory_path):
        fpath = os.path.join(directory_path, fname)
        if os.path.isfile(fpath):
            with open(fpath, "r", encoding="utf-8") as f:
                out.append({"filename": fname, "content": f.read()})
    return out

def recursive_text_splitter(data, chunk_size, overlap_size):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=overlap_size, length_function=len, is_separator_regex=False
    )
    texts = splitter.create_documents(
        [f"{d['filename']}\n{d['content']}" for d in data],
        metadatas=[{"filename": d["filename"]} for d in data],
    )
    return texts

def markdown_splitter(data, chunk_size, overlap_size):
    md_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")], strip_headers=True
    )
    md_splits = []
    for d in data:
        splits = md_splitter.split_text(d["content"])
        for s in splits:
            s.metadata["filename"] = d["filename"]
        md_splits.extend(splits)

    size_limiter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=overlap_size, length_function=len, is_separator_regex=False
    )
    return size_limiter.split_documents(md_splits)

docs_raw = read_files_as_object_array(str(DATA_DIR))
docs_rec = recursive_text_splitter(docs_raw, CHUNK_SIZE, CHUNK_OVERLAP)
docs_md  = markdown_splitter(docs_raw, CHUNK_SIZE, CHUNK_OVERLAP)

print(f"Recursive chunks: {len(docs_rec)} | Markdown+size chunks: {len(docs_md)}")

Recursive chunks: 526 | Markdown+size chunks: 509

Building our Vector Database#

Before our agent can “retrieve” knowledge, we need a database that understands vectors — numerical representations of text meaning (embeddings). That’s where LanceDB comes in.

🔍 What is LanceDB?#

LanceDB is a lightweight, local-first vector database built on the Lance columnar format.
It’s designed for:

Storing and searching high-dimensional embeddings (like text or image vectors).
Performing semantic similarity queries (e.g., “find texts most similar to this query”).
Hybrid retrieval: combining full-text search (BM25 / Tantivy) and vector search.
Speed and simplicity — it runs locally (no separate server needed).

🧠 What we’ll do here#

Embed all Pokémon chunks using OpenRouter’s embedding model (text-embedding-3-large).
Create / connect to a LanceDB table named "pokemon_pages".
Insert each chunk’s text, vector, and metadata (like filename & splitter type).
Build a full-text search (FTS) index for keyword lookups alongside vector search.

After this step, we’ll have a ready-to-query LanceDB store — the foundation for our Retrieval-Augmented Generation (RAG) pipeline.

from dotenv import load_dotenv
from openai import OpenAI

import lancedb
import uuid

load_dotenv()

OPENAI_BASE_URL = "https://openrouter.ai/api/v1"

EMBED_MODEL = os.getenv("EMBEDDINGS_MODEL", "qwen/qwen3-embedding-8b")

client = OpenAI(base_url=OPENAI_BASE_URL, api_key=os.getenv('OPENROUTER_API_KEY'))

DB_URI = "./db/sample-lancedb"
TABLE_NAME_TMP = "pokemon_pages_tmp"
TABLE_NAME = "pokemon_pages"

def embed_texts(texts: List[str], model: str = EMBED_MODEL, batch_size: int = 64) -> List[List[float]]:
    r"""
    Returns a list of embedding vectors. Uses OpenAI-compatible client pointed at OpenRouter.
    r"""
    out = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        resp = client.embeddings.create(model=model, input=batch)
        out.extend([e.embedding for e in resp.data])
    return out

db = lancedb.connect(DB_URI)
try:
    tbl = db.open_table(TABLE_NAME_TMP)
    print(f"Loaded LanceDB at {DB_URI} (table={TABLE_NAME_TMP})")
except:
    all_chunks = []
    for d in docs_rec:
        all_chunks.append({"id": str(uuid.uuid4()), "content": d.page_content,
                        "metadata": {"filename": d.metadata.get("filename",""), "splitter": "recursive"}})
    for d in docs_md:
        all_chunks.append({"id": str(uuid.uuid4()), "content": d.page_content,
                        "metadata": {"filename": d.metadata.get("filename",""), "splitter": "markdown"}})

    print("Embedding chunks...")
    vectors = embed_texts([c["content"] for c in all_chunks])
    for c, v in zip(all_chunks, vectors):
        c["vector"] = v
    tbl = db.create_table(TABLE_NAME_TMP, data=all_chunks)
    tbl.create_fts_index("content")
    print(f"Indexed {len(all_chunks)} chunks into LanceDB at {DB_URI} (table={TABLE_NAME_TMP})")

Loaded LanceDB at ./db/sample-lancedb (table=pokemon_pages_tmp)

Searching the Knowledge Base — Semantic vs Keyword Search#

Now that our Pokémon chunks are stored in LanceDB, let’s learn how to search through them.

🧭 What is Semantic Search?#

Traditional search engines (like keyword or BM25 search) match exact words or phrases in your query. But LLMs and embeddings represent meaning as vectors in high-dimensional space — a semantic space.

In semantic search, we:

Embed the query into a vector (using the same embedding model as our database).
Measure its closeness to all stored vectors (chunks) — using cosine similarity or dot product.
Retrieve the most semantically similar chunks, even if they don’t share exact words.

For example:

Query → “Who evolves into Pikachu?”
Closest text → “Pichu evolves into Pikachu when leveled up with high friendship.”

Even if the word “who” or “friendship” doesn’t appear in both, their embeddings are close in the semantic space, allowing meaning-based retrieval. I recommend watching video on vector search by IBM.

🧩 Three search modes we’ll explore#

Method	Description	Strength
🔡 FTS (Full Text Search)	Matches literal terms using BM25 (like keyword search).	Great for rare names, exact filters, or numeric queries.
🧠 Vector Search	Uses embedding similarity in high-dimensional space.	Captures meaning, paraphrases, and context.
⚡ Hybrid Search	Fuses both (via Reciprocal Rank Fusion).	Balances precision (FTS) and recall (semantic).

The next cell defines functions for each search type and prints their top results side by side —
so you can see how semantic closeness changes the quality of retrieval.

from rich import print as rprint

def perform_vector_search(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    emb = client.embeddings.create(model=EMBED_MODEL, input=[query]).data[0].embedding
    qb = tbl.search(emb).metric('cosine').limit(top_k).select(["content", "metadata", "_distance", "vector"])
    if pokemon:
        qb = qb.where(f"metadata.filename = '{pokemon}.md'")
    return qb.to_list()

def perform_fts_search(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    qb = tbl.search(query, query_type="fts").limit(top_k).select(["content", "metadata", "_score", "vector"])
    if pokemon:
        qb = qb.where(f"metadata.filename = '{pokemon}.md'", prefilter=True)
    return qb.to_list()

def reciprocal_rank_fusion(results_a, results_b, k: int = 60):
    def rid(x): return hash(x["content"])
    scores = {}
    for i, r in enumerate(results_a):
        scores[rid(r)] = scores.get(rid(r), 0) + 1.0/(k+i+1)
    for i, r in enumerate(results_b):
        scores[rid(r)] = scores.get(rid(r), 0) + 1.0/(k+i+1)
    uniq = {}
    for r in results_a + results_b:
        uniq[rid(r)] = r
    ranked = sorted(uniq.values(), key=lambda r: scores[hash(r['content'])], reverse=True)
    return ranked

def perform_hybrid_search(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    vres = perform_vector_search(query, pokemon, top_k=top_k)
    fres = perform_fts_search(query, pokemon, top_k=top_k)
    fused = reciprocal_rank_fusion(vres, fres)[:top_k]
    return fused

queries = [
    "Which Pokémon evolves into Pikachu?",
    "Show Mega evolutions for Charizard",
]

for q in queries:
    rprint(f"\n[bold green]Query:[/] {q}")
    v = perform_vector_search(q, top_k=3)
    f = perform_fts_search(q, top_k=3)
    h = perform_hybrid_search(q, top_k=3)
    rprint("[magenta]Vector top1:[/]", v[0]["metadata"]["filename"], "→", v[0]["content"][:120].replace("\n"," "))
    rprint("[magenta]FTS    top1:[/]", f[0]["metadata"]["filename"], "→", f[0]["content"][:120].replace("\n"," "))
    rprint("[magenta]Hybrid top1:[/]", h[0]["metadata"]["filename"], "→", h[0]["content"][:120].replace("\n"," "))

Query: Which Pokémon evolves into Pikachu?

Vector top1: raichu.md → |  |  | | --- | --- | | Red | [Cerulean Cave](/location/kanto-cerulean-cave) | | Blue | 
[Cerulean Cave](/location/kanto-

FTS    top1: pichu.md → * [What is the average level for pokemon to evolve by 
happiness?](/pokebase/483/what-is-the-average-level-for-pokemon-to

Hybrid top1: raichu.md → |  |  | | --- | --- | | Red | [Cerulean Cave](/location/kanto-cerulean-cave) | | Blue | 
[Cerulean Cave](/location/kanto-

Query: Show Mega evolutions for Charizard

Vector top1: charizard.md → ## Moves learned by Charizard

FTS    top1: charizard.md → The design of *Charizard* appears to draw inspiration from dragons of European 
folklore, with its large wings, horned he

Hybrid top1: charizard.md → ## Moves learned by Charizard

Evaluating Retrieval Quality#

Once our Pokémon chunks are embedded and searchable, we need to measure how well the retrieval step is working. Even the best LLM can only answer correctly if the right information was fetched first.

🧩 Why Evaluation Matters#

RAG systems rely on two main components:

Retrieval – finding the most relevant chunks from the knowledge base.
Generation – the LLM reasoning over those chunks to answer questions.

If retrieval fails (missing or irrelevant chunks), generation will inevitably fail too — no matter how smart the model is.
That’s why retrieval metrics are critical for diagnosing performance.

📊 Metrics - Coverage, Recall, etc.#

Metric	What it measures	Why it matters
Coverage Ratio	How much of the original document text is preserved in the chunked dataset.	Ensures chunking didn’t lose too much information.
Recall@k	Whether at least one relevant chunk appears in the top-k retrieved results.	Tests if the search finds what we need (completeness).
MRR (Mean Reciprocal Rank)	How early in the ranking the first relevant chunk appears.	Rewards search methods that bring correct answers to the top.
Latency (later)	Time taken for each search query.	Balances quality vs speed for production systems.

In the next cell, we’ll start with coverage statistics — verifying that our chunking step retains most of the source content for both splitters (recursive and markdown).
This acts as a sanity check before moving on to deeper retrieval evaluation.

import pandas as pd 

GROUND_TRUTH = {
    "Which Pokémon evolves into Pikachu?": ["pichu.md"],
    "Which Pokémon learns Volt Tackle via breeding/light ball mechanics?": ["pikachu.md", "pichu.md"],
    "Show Mega evolutions for Charizard": ["charizard.md"],
    "Base stats of Mewtwo": ["mewtwo.md"],
    "What is Mewtwo’s base stat total (BST)?": ["mewtwo.md"],
    "What is Slowpoke's type?": ["slowpoke.md"],
    "What moves can Raichu learn by TM?": ["raichu.md"],
}

def coverage_stats(docs_raw, chunks) -> Dict[str, float]:
    total_chars = sum(len(d["content"]) for d in docs_raw)
    chunk_chars = sum(len(c.page_content) for c in chunks)
    return {
        "total_chars": total_chars,
        "chunk_chars": chunk_chars,
        "coverage_ratio": chunk_chars / total_chars if total_chars else 0.0
    }

cov_rec = coverage_stats(docs_raw, docs_rec)
cov_md  = coverage_stats(docs_raw, docs_md)

pd.DataFrame([
    {"splitter": "recursive", **cov_rec},
    {"splitter": "markdown",  **cov_md},
])

	splitter	total_chars	chunk_chars	coverage_ratio
0	recursive	249998	264501	1.058012
1	markdown	249998	259329	1.037324

import time 

def eval_search(queries: List[str], search_fn, ks=(1,3,5)) -> pd.DataFrame:
    rows = []
    for q in queries:
        t0 = time.time()
        results = search_fn(q, top_k=max(ks))
        elapsed = time.time() - t0
        filenames = [r["metadata"]["filename"] for r in results]
        gt = set(GROUND_TRUTH[q])
        recs = {}
        for k in ks:
            recs[f"Recall@{k}"] = 1.0 if any(f in gt for f in filenames[:k]) else 0.0
        rr = 0.0
        for i, f in enumerate(filenames, start=1):
            if f in gt:
                rr = 1.0 / i
                break
        rows.append({"query": q, "latency_ms": round(1000*elapsed,2), "MRR": rr, **recs})
    return pd.DataFrame(rows)

df_vec = eval_search(list(GROUND_TRUTH.keys()), perform_vector_search, ks=tuple(EVAL_K_LIST))
df_fts = eval_search(list(GROUND_TRUTH.keys()), perform_fts_search,    ks=tuple(EVAL_K_LIST))
df_hyb = eval_search(list(GROUND_TRUTH.keys()), perform_hybrid_search, ks=tuple(EVAL_K_LIST))

summary = pd.DataFrame({
    "Method": ["Vector","FTS","Hybrid"],
    "MRR(mean)": [df_vec["MRR"].mean(), df_fts["MRR"].mean(), df_hyb["MRR"].mean()],
    **{f"Recall@{k}(mean)": [df_vec[f"Recall@{k}"].mean(), df_fts[f"Recall@{k}"].mean(), df_hyb[f"Recall@{k}"].mean()] for k in EVAL_K_LIST},
    "Latency(ms, mean)": [df_vec["latency_ms"].mean(), df_fts["latency_ms"].mean(), df_hyb["latency_ms"].mean()],
}).round(3)

rprint("[bold]Per-query (Vector):[/]"); display(df_vec)
rprint("[bold]Per-query (FTS):[/]"); display(df_fts)
rprint("[bold]Per-query (Hybrid):[/]"); display(df_hyb)
rprint("[bold green]Summary:[/]"); display(summary)

Per-query (Vector):

	query	latency_ms	MRR	Recall@1	Recall@3	Recall@5
0	Which Pokémon evolves into Pikachu?	586.07	0.0	0.0	0.0	0.0
1	Which Pokémon learns Volt Tackle via breeding/...	501.79	0.5	0.0	1.0	1.0
2	Show Mega evolutions for Charizard	4004.79	1.0	1.0	1.0	1.0
3	Base stats of Mewtwo	464.45	1.0	1.0	1.0	1.0
4	What is Mewtwo’s base stat total (BST)?	533.87	1.0	1.0	1.0	1.0
5	What is Slowpoke's type?	886.77	1.0	1.0	1.0	1.0
6	What moves can Raichu learn by TM?	2282.72	1.0	1.0	1.0	1.0

Per-query (FTS):

	query	latency_ms	MRR	Recall@1	Recall@3	Recall@5
0	Which Pokémon evolves into Pikachu?	33.01	1.0	1.0	1.0	1.0
1	Which Pokémon learns Volt Tackle via breeding/...	17.15	1.0	1.0	1.0	1.0
2	Show Mega evolutions for Charizard	0.00	1.0	1.0	1.0	1.0
3	Base stats of Mewtwo	19.67	1.0	1.0	1.0	1.0
4	What is Mewtwo’s base stat total (BST)?	13.68	1.0	1.0	1.0	1.0
5	What is Slowpoke's type?	18.08	1.0	1.0	1.0	1.0
6	What moves can Raichu learn by TM?	15.24	0.0	0.0	0.0	0.0

Per-query (Hybrid):

	query	latency_ms	MRR	Recall@1	Recall@3	Recall@5
0	Which Pokémon evolves into Pikachu?	533.85	0.333333	0.0	1.0	1.0
1	Which Pokémon learns Volt Tackle via breeding/...	500.46	1.000000	1.0	1.0	1.0
2	Show Mega evolutions for Charizard	667.26	1.000000	1.0	1.0	1.0
3	Base stats of Mewtwo	515.35	1.000000	1.0	1.0	1.0
4	What is Mewtwo’s base stat total (BST)?	487.06	1.000000	1.0	1.0	1.0
5	What is Slowpoke's type?	4685.00	1.000000	1.0	1.0	1.0
6	What moves can Raichu learn by TM?	491.01	1.000000	1.0	1.0	1.0

Summary:

	Method	MRR(mean)	Recall@1(mean)	Recall@3(mean)	Recall@5(mean)	Latency(ms, mean)
0	Vector	0.786	0.714	0.857	0.857	1322.923
1	FTS	0.857	0.857	0.857	0.857	16.690
2	Hybrid	0.905	0.857	1.000	1.000	1125.713

🔎 Interpreting the Results#

TL;DR: Hybrid wins on quality; FTS wins on speed.

Hybrid (MRR=0.90, Recall@3/5=1.0): Best overall retrieval quality. Reciprocal Rank Fusion (RRF) captures semantic matches that FTS misses while still surfacing exact-term hits. Ideal default for general-purpose RAG.
Vector (MRR=0.78, Recall@5=0.85, ~1300 ms): Strong semantic coverage—great when users paraphrase. Slightly slower due to embedding + nearest-neighbor search.
FTS (MRR=0.85, Recall@k ≤ 0.86, ~17 ms): Blazing fast and excels for exact names, forms, numbers (e.g., “TM”, “Mega”). But it can miss paraphrases or semantic matches.

What to deploy

Default: Hybrid.
Query routing: Use FTS for quoted phrases/IDs/numerics; otherwise Hybrid.
Latency-sensitive paths: FTS with a semantic fallback on low-confidence.

Improving Precision — What is Reranking and Why It Helps#

Even after combining vector and keyword search, our top results may still include partially relevant or redundant chunks. That’s where reranking comes in — a crucial final step in the retrieval pipeline.

🎯 What is Reranking? Reranking means taking the initial set of retrieved results (e.g., top 20) and reordering them using a more accurate relevance model. This model computes a finer-grained similarity between the query and each retrieved chunk.

Common reranking approaches:

Embedding-based cosine similarity (lightweight) — compares the query vector with each chunk’s vector (as we’ll do here).
Cross-encoder models (heavier) — feed [query, passage] pairs into an LLM or BERT-like model for deeper contextual matching.

💡 Why Reranking Helps

First-stage retrieval (vector/FTS/hybrid) is optimized for speed, not precision.
Reranking refines the order to push the most semantically aligned chunks to the top, improving MRR and answer faithfulness.
It’s especially useful when:
- Many chunks share overlapping content.
- The query is nuanced or multi-faceted (e.g., “Mega evolutions and base stats of Charizard”).
- You plan to feed only a few chunks into the LLM for generation.

In the next cell, we’ll apply a simple cosine-similarity-based reranker that reorders hybrid search results using the query’s embedding — a fast and effective upgrade for small to mid-sized RAG systems.

import numpy as np 

def cosine(a, b):
    a = np.array(a); b = np.array(b)
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-9))

def rerank_by_query_vector(query: str, results: List[Dict[str, Any]], top_k: int = 5):
    """
    Rerank retrieved results based on cosine similarity 
    between the query embedding and each result’s embedding vector.
    """
    qv = client.embeddings.create(model=EMBED_MODEL, input=[query]).data[0].embedding
    rescored = []
    for r in results:
        rescored.append((cosine(qv, r['vector']), r))
    rescored = sorted(rescored, key=lambda x: x[0], reverse=True)
    rescored = [r for _, r in rescored[:top_k]]
    results, mds = [], set()
    for r in rescored:
        if r['metadata']['filename'] in mds: continue
        mds.add(r['metadata']['filename']); results.append(r)
    return results

def perform_hybrid_rerank(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    fused = perform_hybrid_search(query, pokemon, top_k=top_k*10)
    return rerank_by_query_vector(query, fused, top_k=top_k)

df_hyr = eval_search(list(GROUND_TRUTH.keys()), perform_hybrid_rerank, ks=tuple(EVAL_K_LIST))

summary = pd.DataFrame({
    "Method": ["Vector","FTS","Hybrid","Reranking"],
    "MRR(mean)": [df_vec["MRR"].mean(), df_fts["MRR"].mean(), df_hyb["MRR"].mean(), df_hyr["MRR"].mean()],
    **{f"Recall@{k}(mean)": [df_vec[f"Recall@{k}"].mean(), df_fts[f"Recall@{k}"].mean(), df_hyb[f"Recall@{k}"].mean(), df_hyr[f"Recall@{k}"].mean()] for k in EVAL_K_LIST},
    "Latency(ms, mean)": [df_vec["latency_ms"].mean(), df_fts["latency_ms"].mean(), df_hyb["latency_ms"].mean(), df_hyr["latency_ms"].mean()],
}).round(3)

rprint("[bold]Per-query (Hybrid + Rerank):[/]"); display(df_hyr)
rprint("[bold green]Summary:[/]"); display(summary)

Per-query (Hybrid + Rerank):

	query	latency_ms	MRR	Recall@1	Recall@3	Recall@5
0	Which Pokémon evolves into Pikachu?	2488.60	0.333333	0.0	1.0	1.0
1	Which Pokémon learns Volt Tackle via breeding/...	2057.19	1.000000	1.0	1.0	1.0
2	Show Mega evolutions for Charizard	1164.91	1.000000	1.0	1.0	1.0
3	Base stats of Mewtwo	1143.49	1.000000	1.0	1.0	1.0
4	What is Mewtwo’s base stat total (BST)?	1174.59	1.000000	1.0	1.0	1.0
5	What is Slowpoke's type?	1132.06	1.000000	1.0	1.0	1.0
6	What moves can Raichu learn by TM?	7774.14	1.000000	1.0	1.0	1.0

Summary:

	Method	MRR(mean)	Recall@1(mean)	Recall@3(mean)	Recall@5(mean)	Latency(ms, mean)
0	Vector	0.786	0.714	0.857	0.857	1322.923
1	FTS	0.857	0.857	0.857	0.857	16.690
2	Hybrid	0.905	0.857	1.000	1.000	1125.713
3	Reranking	0.905	0.857	1.000	1.000	2419.283

Takeaway: Reranking yields the highest retrieval precision (MRR↑) with nearly perfect recall, though at a higher latency cost. In practice, it’s often used as an optional second stage — applied only when the agent is uncertain or when quality matters more than speed.

🔧 Packaging Retrieval as “Tools” for Agents#

Now that we have multiple retrieval strategies — vector, FTS, hybrid — we’ll wrap them into simple, reusable tools that return formatted text contexts.

These tools will later be used by our PydanticAI agent to decide:

Which search mode to use (routing),
How much context to retrieve, and
When to combine multiple sources (reflection and fusion).

Let’s define these tool functions next.

import logfire
import nest_asyncio

nest_asyncio.apply()

logfire.configure(send_to_logfire=False) # set to true if you want to use logfire console
logfire.instrument_pydantic_ai()

try:
    tbl = db.open_table(TABLE_NAME)
except:
    tbl = db.open_table(TABLE_NAME_TMP)

def build_context_from_results(results: List[Dict[str,Any]]):
    return "\n---\n".join([
        f"Title: {r['metadata']['filename']}\nContent:\n{r['content']}"
        for r in results
    ])

def tool_vector(query: str, k: int = 5) -> str:
    """Vector search"""
    logfire.info(f"Vector search called with query: {query}")
    res = perform_vector_search(query, top_k=k)
    return build_context_from_results(res)

def tool_fts(query: str, k: int = 5) -> str:
    """Full Text Search"""
    logfire.info(f"FTS search called with query: {query}")
    res = perform_fts_search(query, top_k=k)
    return build_context_from_results(res)

def tool_hybrid(query: str, k: int = 5) -> str:
    """Hybrid Search"""
    logfire.info(f"Hybrid search called with query: {query}")
    res = perform_hybrid_search(query, top_k=k)
    return build_context_from_results(res)

def tool_rerank(query: str, k: int = 5) -> str:
    "Reranking Search"
    logfire.info(f"Reranking search called with query: {query}")
    res = perform_hybrid_rerank(query, top_k=k)
    return build_context_from_results(res)

Logfire project URL: https://logfire-eu.pydantic.dev/shreshthtuli/agenticai

🔮 From Plain LLM to RAG-Enhanced Agent#

Now that our retrieval tools are ready, let’s test how much they actually help the model think.

🧠 Two Agents, Two Worlds#

We’ll create two simple agents using PydanticAI:

Agent	Description	Data Access
🧩 Vanilla Agent	A plain LLM (e.g., Grok-4 or GPT-4) answering directly from its internal training data.	❌ No external context
📚 RAG Agent	Same model, but grounded with retrieved Pokémon chunks from LanceDB. It must answer only from the provided context.	✅ Uses hybrid search tool

⚔️ The Test#

We’ll ask both agents the same question:

“Who has more powerful normal type attack — Charizard or Pikachu?”

The Vanilla Agent relies purely on what it “remembers.”
The RAG Agent, on the other hand, performs:

Retrieval — pulls relevant chunks from our local Pokémon corpus using tool_hybrid.
Grounded generation — answers based strictly on retrieved evidence and cites sources (e.g., [charizard.md]).

This comparison highlights how RAG agents can reduce hallucinations and provide traceable, verifiable answers even with small, domain-specific knowledge bases.

from pydantic_ai import Agent
from pydantic import BaseModel, Field

CHAT_MODEL  = os.getenv("CHAT_MODEL", "openrouter:x-ai/grok-4-fast")

class VanillaAnswer(BaseModel):
    answer: str = Field(description="Concise, factual answer for the given query.")

class RAGAnswer(BaseModel):
    answer: str = Field(description="Concise, factual answer for the given query.")
    used_tool: str = Field(description="Which tool was used: vector | fts | hybrid")
    citation: str = Field(description="Filename used to generate response.")

vanilla_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You area pokemon expert. Answer given questions"
    ),
    output_type=VanillaAnswer,
    retries=3
)

rag_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT. If unknown, say 'I don't know from the corpus'. "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_hybrid],
    retries=3
)


q = "Who has more powerful normal type attack - Charizard or Pikachu?"
vanilla_response = vanilla_agent.run_sync(q)

rprint(vanilla_response.output)

rag_response = rag_agent.run_sync(q)

rprint(rag_response)

11:20:50.065 vanilla_agent run
11:20:50.072   chat x-ai/grok-4-fast

VanillaAnswer(answer='Charizard')

20:52.825 rag_agent run
20:52.825   chat x-ai/grok-4-fast
20:55.145   running 1 tool
20:55.145     running tool: tool_hybrid
20:55.145       Hybrid search called with query: Charizard vs Pikachu normal type attacks power comparison
20:56.037   chat x-ai/grok-4-fast
20:59.007   running 1 tool
20:59.007     running tool: tool_hybrid
20:59.007       Hybrid search called with query: Pikachu normal type moves power
21:06.900   chat x-ai/grok-4-fast
21:09.728   running 1 tool
21:09.730     running tool: tool_hybrid
21:09.730       Hybrid search called with query: Pikachu moves list normal type power
21:10.272   chat x-ai/grok-4-fast

AgentRunResult(
    output=RAGAnswer(
        answer="Charizard has a more powerful Normal-type attack, with Slash at 70 power, while Pikachu's 
Normal-type moves are not specified in the provided context.",
        used_tool='hybrid',
        citation='charizard.md'
    )
)

In the logs above, you can see a clear difference:

🧩 Vanilla Agent:
Likely gave a vague or partially correct answer — it relies on its pretrained world knowledge, which may be outdated or incomplete.
It has no access to our curated Pokémon corpus, so its response can drift or even hallucinate.
📚 RAG Agent (Correct Answer):
Retrieved the Charizard and Pikachu entries from our LanceDB knowledge base, analyzed their base attack stats,
and correctly identified that Charizard has the stronger Normal-type attack — with a source citation (e.g., [charizard.md]).

This demonstrates the core benefit of Retrieval-Augmented Generation:

It grounds responses in real, verifiable data.
It produces contextually correct and source-traceable answers.
It reduces hallucinations and improves trustworthiness — especially in factual, domain-specific tasks.

In short, the RAG agent doesn’t guess — it knows where to look.

In the next cell, we’ll run the same question again and compare how the Reranking Agent responds.
Look for stronger alignment with retrieved facts and clearer source citations.

reranking_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_rerank],
    retries=3
)


q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = reranking_agent.run_sync(q)

rprint(rag_response)

29:43.530 reranking_agent run
29:43.532   chat x-ai/grok-4-fast
29:45.807   running 1 tool
29:45.808     running tool: tool_rerank
29:45.809       Reranking search called with query: Charizard Pikachu normal type moves power comparison
29:48.467   chat x-ai/grok-4-fast
29:52.070   running 1 tool
29:52.080     running tool: tool_rerank
29:52.080       Reranking search called with query: Charizard learnable normal type moves power
29:54.960   chat x-ai/grok-4-fast
29:57.605   running 1 tool
29:57.605     running tool: tool_rerank
29:57.606       Reranking search called with query: Pikachu learnable normal type moves power
29:59.481   chat x-ai/grok-4-fast

AgentRunResult(
    output=RAGAnswer(
        answer='Based on the available context, Pikachu\'s type effectiveness chart shows normal type moves are 
neutral against it, but no specific normal type moves or powers for either Pokémon are detailed in the provided 
files. Therefore, I cannot determine which has the more powerful normal type attack from the context 
alone.</parameter name>\\n<parameter name="used_tool">hybrid',
        used_tool='vector',
        citation='pikachu.md'
    )
)

🧠 Building a Smarter Agent — Multi-Tool Retrieval and Dynamic Reasoning#

So far, we’ve seen each retrieval method in isolation — vector, keyword, hybrid, and reranking. But real-world questions vary in structure: some are factual, some numeric, some semantic. No single search method fits them all.

🛠️ Enter the Multi-Tool Agent#

In this step, we give our RAG agent access to all retrieval tools:

🔡 tool_fts → for exact terms (e.g., “TM45” or “Base stats”).
🧠 tool_vector → for meaning-based matches and paraphrases.
⚡ tool_hybrid → for balanced performance.
🎯 tool_rerank → for highest-precision reranked retrieval.

The agent can now choose the best tool dynamically based on query type and context — an early example of tool orchestration or self-routing.

This brings us closer to a true agentic RAG system — one that reasons about how to reason.

multitool_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_fts, tool_vector, tool_hybrid, tool_rerank],
    retries=3
)


q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = multitool_agent.run_sync(q)

rprint(rag_response)

30:02.783 multitool_agent run
30:02.785   chat x-ai/grok-4-fast
30:08.556   running 1 tool
30:08.556     running tool: tool_hybrid
30:08.556       Hybrid search called with query: Charizard normal type moves base power
30:09.484   chat x-ai/grok-4-fast
30:11.066   running 1 tool
30:11.067     running tool: tool_hybrid
30:11.069       Hybrid search called with query: Pikachu normal type moves base power
30:22.920   chat x-ai/grok-4-fast
30:25.104   running 1 tool
30:25.104     running tool: tool_hybrid
30:25.104       Hybrid search called with query: Charizard learnable Normal type moves base power
30:25.858   chat x-ai/grok-4-fast
30:29.492   running 1 tool
30:29.492     running tool: tool_hybrid
30:29.492       Hybrid search called with query: Pikachu learnable Normal type moves base power
30:30.251   chat x-ai/grok-4-fast

AgentRunResult(
    output=RAGAnswer(
        answer="Charizard has a more powerful Normal-type attack than Pikachu. Charizard's base Attack stat is 84, 
compared to Pikachu's 55, and it learns stronger physical Normal moves like Slash (70 base power) by level-up, 
while Pikachu's strongest level-up Normal move is Quick Attack (40 base power). Even for shared moves like Hyper 
Beam (150 base power via TM), Charizard's higher Attack stat results in greater overall power.",
        used_tool='hybrid',
        citation='charizard.md'
    )
)

🧩 Contextualised Retrieval — Using an LLM to Summarize Retrieved Evidence#

So far, our agents have pulled relevant chunks from LanceDB and fed them as-is into the answering model. However, as context grows, simply concatenating text leads to redundancy, token waste, and sometimes distracting noise. To address this, we introduce Contextualised Retrieval — a smarter approach where a small LLM acts as a retrieval summarizer.

🧠 How this works#

Retrieve: The agent first collects top-k chunks via hybrid search.
Summarize: A lightweight retrieval assistant LLM processes these chunks and condenses them into a focused summary.
Augment: The final answering agent then uses this context summary plus the original chunks for grounded reasoning.

I highly recommend the reader to go through Anthropic’s Guide on Contextual Retrieval, and article by Wang et al. (2025).

🎯 Why this matters#

Reduces token and latency overhead by summarizing only key attributes (types, evolutions, base stats).
Improves signal-to-noise ratio, especially when multiple retrieved chunks overlap.
Enables a more scalable agentic retrieval loop, where the model reflects on retrieved context before reasoning.

However, this also comes at the cost of one LLM call per search.

In the next cell, we’ll define:

build_context_from_results_via_llm() → uses an LLM to synthesize a compact, focused context summary.
tool_hybrid_contextualised() → wraps hybrid retrieval + summarization as a single callable tool.

We’ll then run our Contextualised Agent to answer the same question — expect shorter, sharper answers with clear citations and improved factual consistency.

def build_context_from_results_via_llm(query: str, results: List[Dict[str,Any]]):
    combined = build_context_from_results(results)

    retrieval_agent = Agent(
        model=CHAT_MODEL,
        system_prompt=(
            "You are a retrieval assistant helping an LLM ground its reasoning. "
            "Given the retrieved Pokémon entries below, summarize only the most relevant "
            "details and context in 3–5 concise sentences. Focus on types, evolutions, "
            "base stats, and notable traits that help answer factual questions.\n\n"
            f"Input query:\n {query}\n\n"
            f"Retrieved context:\n {combined}"
        ),
        retries=3
    )

    summary = retrieval_agent.run_sync("").output
    logfire.info(f"Summary returned: {summary}")
    return f"### Context Summary\n{summary}\n\n---\n### Full Retrieved Chunks\n{combined}"


def tool_hybrid_contextualised(query: str, k: int = 5) -> str:
    logfire.info(f"Contextual Retrieval search called with query: {query}")
    res = perform_hybrid_search(query, top_k=k)
    return build_context_from_results_via_llm(query, res)

contextual_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_hybrid_contextualised],
    retries=3
)


q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = contextual_agent.run_sync(q)

rprint(rag_response)

52:31.843 contextual_agent run
52:31.845   chat x-ai/grok-4-fast
52:34.406   running 1 tool
52:34.406     running tool: tool_hybrid_contextualised
52:34.406       Contextual Retrieval search called with query: Charizard Pikachu normal type attack power comparison
52:35.609       retrieval_agent run
52:35.609         chat x-ai/grok-4-fast
52:41.747       Summary returned: Pikachu is an Electric-type Pokémon with neutral effectiveness... compared to Pikachu's unevolved state without provided stats.
52:41.769   chat x-ai/grok-4-fast
52:45.984   running 1 tool
52:45.984     running tool: tool_hybrid_contextualised
52:45.984       Contextual Retrieval search called with query: Pikachu base stats attack
52:48.177       retrieval_agent run
52:48.177         chat x-ai/grok-4-fast
52:54.938       Summary returned: Pikachu is an Electric-type Pokémon that evolves from Pichu an...tack stat (standardly known as 55, though not confirmed here).
52:54.943   chat x-ai/grok-4-fast
52:57.483   running 1 tool
52:57.484     running tool: tool_hybrid_contextualised
52:57.485       Contextual Retrieval search called with query: Pikachu base Attack stat
52:58.178       retrieval_agent run
52:58.178         chat x-ai/grok-4-fast
53:05.286       Summary returned: Pikachu is an Electric-type Pokémon that evolves from Pichu an...etrieved entry, focusing instead on general Pokédex structure.
53:05.301   chat x-ai/grok-4-fast

AgentRunResult(
    output=RAGAnswer(
        answer="Charizard has a more powerful Normal-type attack due to its higher base Attack stat of 84 (maximum 
293 at level 100), compared to Pikachu's unevolved state with no stats provided in the context, implying lower 
offensive capability for physical attacks like Normal-type moves.",
        used_tool='hybrid',
        citation='[charizard.md]'
    )
)

🔄 Multi-Query Retrieval — Expanding Recall Through Paraphrased Queries#

Even the best retrievers can miss information if the query wording doesn’t match the phrasing in the knowledge base. For example, “Who has stronger normal attacks?” and “Which Pokémon hits harder with normal moves?” express the same intent — but may retrieve different chunks due to surface-level differences in tokens and structure.

To make our system more robust, we introduce Multi-Query Retrieval, also known as Query Augmentation or Multi-Vector RAG.

🧩 How it works#

Generate paraphrases — An auxiliary query-rewriting agent produces multiple semantically equivalent versions of the input question.
Retrieve per variant — Each variation runs its own hybrid search in LanceDB.
Merge and deduplicate — Retrieved results are combined and deduplicated to form a richer, more complete context.

This strategy helps the system:

Capture lexical and syntactic diversity in stored text.
Improve Recall@k and coverage, especially for sparse or under-represented phrasing.
Provide redundant grounding, which stabilizes the final generation step.

I highly recommend looking at the seminal work by Kostric and Balog (2024) on this.

⚙️ What this code does#

tool_multiquery()
→ Generates paraphrased queries via a query-rewriting agent, retrieves hybrid results for each, and merges them.
multiquery_agent
→ Uses this tool to answer the same comparison question while grounding on a broader semantic context.

This approach trades a small latency increase for higher recall and resilience, bringing our RAG pipeline closer to modern multi-query ensemble systems used in production LLM retrieval frameworks.

from itertools import chain

class QueryVariations(BaseModel):
    variations: List[str]

def tool_multiquery(query: str, num_variations: int = 3, k: int = 5):
    """Run RAG with multiple paraphrased query variants to improve robustness."""
    # Step 1: Generate paraphrases of the input query
    variation_agent = Agent(
        model=CHAT_MODEL,
        system_prompt=(
            "You are a query rewriting assistant. Given a question, produce "
            f"{num_variations} short paraphrases that preserve meaning but vary wording."
        ),
        output_type=QueryVariations,
        retries=2
    )
    variations = variation_agent.run_sync(query).output.variations
    logfire.info(f"Variations: {variations}")
    queries = [query] + variations

    # Step 2: Retrieve results for all query variants
    all_results = list(chain.from_iterable(perform_hybrid_search(q, top_k=k) for q in queries))

    # Step 3: Deduplicate by content hash and merge
    unique_results = {hash(r["content"]): r for r in all_results}.values()

    return build_context_from_results(list(unique_results))

multiquery_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_multiquery],
    retries=3
)

q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = multiquery_agent.run_sync(q)

rprint(rag_response)

54:49.332 multiquery_agent run
54:49.333   chat x-ai/grok-4-fast
54:52.268   running 1 tool
54:52.268     running tool: tool_multiquery
54:52.283       variation_agent run
54:52.284         chat x-ai/grok-4-fast
54:55.020       Variations: ['Between Charizard and Pikachu in Pokémon, which has the stro... Pikachu, which Pokémon has the mightier normal type attack?']
55:02.756   chat x-ai/grok-4-fast
55:07.015   running 1 tool
55:07.015     running tool: tool_multiquery
55:07.015       variation_agent run
55:07.015         chat x-ai/grok-4-fast
55:09.834       Variations: ["What are Pikachu's base stats in Pokémon?", "Pikachu's basic stats in the Pokémon games", 'Base stats of Pikachu in Pokémon']
55:12.920   chat x-ai/grok-4-fast
55:15.711   running 1 tool
55:15.711     running tool: tool_multiquery
55:15.714       variation_agent run
55:15.715         chat x-ai/grok-4-fast
55:18.375       Variations: ["What are Pikachu's base Attack and Special Attack stats in P...Pokémon, Pikachu's base stats for Attack and Special Attack?"]
55:25.551   chat x-ai/grok-4-fast

AgentRunResult(
    output=RAGAnswer(
        answer="Charizard has a more powerful Normal-type attack than Pikachu, as its base Attack stat (84) is 
higher than Pikachu's (55), and Normal-type moves like Tackle are physical, relying on the Attack stat.",
        used_tool='vector',
        citation='charizard.md'
    )
)

🔁 Iterative Retrieval with FLARE#

So far, all our retrieval methods (hybrid, reranking, multi-query) assumed that a single retrieval pass is enough. But what if the question is underspecified or requires multi-hop reasoning — e.g., connecting multiple facts across Pokémon pages?

In such cases, a model needs to:

Recognize when information is missing, and
Formulate follow-up retrievals to fill those gaps.

This idea leads us to FLARE — Feedback Loop for Adaptive Retrieval Enhancement.

⚙️ How FLARE works#

Instead of doing one retrieval step, the agent operates in a loop:

The model analyzes the current context and identifies information needs (needs).
If evidence is missing, it generates new sub-queries (like “Charizard base attack stat” or “Pikachu move power”).
The system performs hybrid retrieval for each need and expands the context.
Once enough evidence is gathered, it produces a final grounded answer (final_answer).

This structured, multi-step reasoning loop makes retrieval adaptive and self-aware, reducing hallucination risk. See the paper by Jiang et al. (2023).

🧠 Key concepts demonstrated#

Concept	Description
🪞 Self-reflective retrieval	The model inspects its own context and identifies missing information.
🔄 Iterative retrieval loop	It autonomously issues and resolves follow-up queries.
📑 Structured reasoning schema	Outputs are typed (`needs`, `final_answer`), ensuring interpretability.

🔬 What this code does#

Defines FLAREAnswer → a structured schema with two fields: needs (follow-up queries) and final_answer (final grounded output).
Implements flare_agent → a PydanticAI agent that follows the FLARE reasoning pattern.
Defines flare_answer() → runs the adaptive retrieval loop up to max_steps, adding new context at each iteration.

By the end, you’ll see how the agent autonomously plans, retrieves, and finalizes answers — a crucial building block toward fully agentic RAG systems that can think before they answer.

from typing import List, Optional

# Structured output: NEEDs + optional final field
class FLAREAnswer(BaseModel):
    needs: List[str] = Field(default_factory=list, description="Follow-up retrieval queries.")
    final_answer: Optional[str] = Field(default=None, description="Final answer when sufficient evidence.")

# Ask the model to fill the structured schema directly
flare_agent = Agent(
    model="openrouter:google/gemini-2.5-pro",
    system_prompt=(
        "You answer strictly from CONTEXT.\n"
        "- If info is missing, populate `needs` with 1-3 short search queries.\n"
        "- When sufficient evidence is present in the CONTEXT, leave `needs` empty and write `final_answer`.\n"
        "Do not invent facts; cite filenames in the answers like [pikachu.md]."
    ),
    output_type=FLAREAnswer,
    retries=2,
)

def flare_answer(question: str, max_steps: int = 3, per_need_k: int = 5) -> FLAREAnswer:
    context, used = "", []
    for _ in range(max_steps):
        msg = f"CONTEXT:\n{context}\n\nQUESTION: {question}\n"
        out = flare_agent.run_sync(msg).output

        # If final available or no needs, return immediately
        if not out.needs or (out.final_answer and out.final_answer != 'null'):
            return out, used

        # Retrieve for each needed query and expand context
        new_ctx = []
        for q in out.needs:
            used.append(q)
            new_ctx.append(f"QUERY: {q}.\n\nRESPONSE:{tool_hybrid(q, k=per_need_k)}")
        context += ("\n\n" + "\n\n".join(new_ctx)) if new_ctx else ""

    # Last attempt: ask for a final answer with accumulated context
    final = flare_agent.run_sync(f"CONTEXT:\n{context}\n\nQUESTION: {question}\n").output
    return final, used

# --- Example ---
q = "Who has more powerful normal type attack - Charizard or Pikachu?"
res, used = flare_answer(q)

rprint("Final:", res.final_answer)
rprint("Needs issued:", res.needs)
rprint("Used queries:", used)

47:57.417 flare_agent run
47:57.417   chat google/gemini-2.5-pro
48:03.617 Hybrid search called with query: Pikachu normal type attacks
48:04.817 Hybrid search called with query: Charizard normal type attacks
48:05.505 flare_agent run
48:05.505   chat google/gemini-2.5-pro
48:17.112 Hybrid search called with query: Pikachu base attack stat
48:18.197 Hybrid search called with query: Pikachu normal type moves
48:18.900 Hybrid search called with query: Charizard normal type moves
48:21.617 flare_agent run
48:21.617   chat google/gemini-2.5-pro

Final: Charizard has a more powerful normal type attack than Pikachu. Charizard can learn Slash, which has a power 
of 70, while Pikachu's most powerful normal type attack is Tackle, with a power of 40.  Additionally, Charizard's 
base attack stat is 84, compared to Pikachu's 55 .

Needs issued:
[]

Used queries:
[
    'Pikachu normal type attacks',
    'Charizard normal type attacks',
    'Pikachu base attack stat',
    'Pikachu normal type moves',
    'Charizard normal type moves'
]

🧭 Self-RAG — Self-Reflective Retrieval and Generation#

We’ve now seen how agents can perform adaptive retrieval loops (FLARE) and multi-query reasoning. The next frontier in agentic RAG is self-assessment — teaching the model to critique its own answers and refine them automatically. This is the core idea behind Self-RAG (Self-Reflective Retrieval-Augmented Generation).

🧠 What is Self-RAG?#

Self-RAG (Yoran et al., 2023) introduces a closed-loop system where the LLM not only retrieves and answers, but also evaluates the quality of its own reasoning using structured feedback signals.

In this setup:

The Generator agent (gen) produces an answer grounded in retrieved context.
The Critic agent (crit) reviews that answer for:
- Support score (0–1): how well the evidence backs the answer.
- Hallucination risk: likelihood of unsupported or fabricated information.
- Citation sufficiency: whether the cited documents justify the claim.
- Missing evidence queries: follow-up retrievals needed to strengthen the answer.
If the critic identifies gaps, the system issues additional retrievals, expands the context, and retries — iterating until confidence crosses a threshold or the loop limit is reached.

⚙️ What this code does#

Defines two structured outputs:
- Ans → stores the answer, citations, and retrieval tool used.
- Crit → stores evaluation metrics and follow-up needs.
Creates two agents:
- gen (generator) — answers based on context.
- crit (critic) — evaluates the generator’s response.
Implements selfrag() — a multi-turn self-reflective retrieval loop combining both:
- The generator writes → the critic reviews → retrieval expands → iteration continues.
The process stops when:
- The support score ≥ threshold (th),
- Citations are adequate, and
- Hallucination risk is low.

To learn more, go through the paper by Asai et al. (2023).

🎯 Why Self-RAG matters#

Self-RAG represents a step toward autonomous retrieval governance — systems that know what they don’t know and can ask the right follow-up questions. It reduces hallucinations, improves factual grounding, and creates interpretable reasoning logs (history).

In the next cell, we’ll run selfrag() on the Charizard vs Pikachu question and observe how the model iteratively critiques, retrieves, and converges to a reliable, cited answer.

import json 

class Ans(BaseModel):
    answer: str
    citations: List[str] = []
    used_tool: str = "hybrid"

class Crit(BaseModel):
    correctness_score: float = Field(description="How good the generation is as a float between 0 and 1")
    hallucination_risk: str
    citation_ok: bool
    missing_evidence_queries: List[str] = []

gen = Agent(
    model="openrouter:google/gemini-2.5-pro",
    system_prompt="Answer strictly from CONTEXT; if unknown say so. Cite filenames like [pikachu.md].",
    output_type=Ans,
    retries=3
)

crit = Agent(
    model="openrouter:google/gemini-2.5-pro",
    system_prompt=("Score correctness of ANSWER from CONTEXT (0-1), flag hallucination (low|medium|high), "
                   "whether citations suffice, and list up to 3 short follow-up queries. Correctness should be high only if the generation answers the query with required facts."),
    output_type=Crit, 
    retries=3
)

def selfrag(q: str, loops: int = 3, th: float = 0.8, k_init: int = 5, k_need: int = 5):
    ctx, hist, used = tool_hybrid(q, k_init), [], []
    for step in range(1, loops + 1):
        a = gen.run_sync(f"CONTEXT:\n{ctx}\n\nQUESTION: {q}").output
        c = crit.run_sync(f"CONTEXT:\n{ctx}\n\nQUESTION:\n{q}\n\nANSWER:\n{a.answer}").output

        state = {"step": step, "correctness": c.correctness_score, "risk": c.hallucination_risk,
                     "citation_ok": c.citation_ok, "missing": c.missing_evidence_queries, "generation": a.answer}
    
        logfire.info(json.dumps(state, indent=4).replace("{", "").replace("}", ""))

        hist.append(state)
        
        if c.correctness_score >= th and c.citation_ok and c.hallucination_risk.lower() == "low" and c.missing_evidence_queries is []:
            return {"final": a.answer, "used_tool": a.used_tool, "used_queries": used, "history": hist}
        
        for need in c.missing_evidence_queries:
            used.append(need); ctx += "\n\n" + tool_hybrid(need, k_need)
        if not c.missing_evidence_queries:  # low info but no needs → broaden once
            
            ctx += "\n\n" + tool_hybrid(q, max(3, k_need // 2))

    return {"final": hist[-1]["step"] and a.answer, "used_tool": "hybrid",
            "used_queries": used, "history": hist, "note": "Stopped at max loops."}

q = "Who has more powerful normal type attack - Charizard or Pikachu?"
res = selfrag(q)

rprint("Final:\n", res["final"])
rprint("Used queries:\n", res["used_queries"])
rprint("History:\n", [{k: h[k] for k in ("step","correctness","risk","citation_ok")} for h in res["history"]])

17:10:33.784 Hybrid search called with query: Who has more powerful normal type attack - Charizard or Pikachu?
17:10:34.851 gen run
17:10:34.851   chat google/gemini-2.5-pro
17:10:40.378 crit run
17:10:40.378   chat google/gemini-2.5-pro
17:10:50.437 
    "step": 1,
    "correctness": 1.0,
    "risk": "low",
    "citation_ok": true,
    "missing": [
        "Pikachu attack stats",
        "Charizard attack stats",
        "Pikachu vs Charizard attack power"
    ],
    "generation": "I am sorry, but this document does not contain the answer to this question. \n"

17:10:50.437 Hybrid search called with query: Pikachu attack stats
17:10:52.099 Hybrid search called with query: Charizard attack stats
17:10:53.606 Hybrid search called with query: Pikachu vs Charizard attack power
17:10:54.324 gen run
17:10:54.324   chat google/gemini-2.5-pro
17:11:01.248 crit run
17:11:01.248   chat google/gemini-2.5-pro
17:11:12.324 
    "step": 2,
    "correctness": 1.0,
    "risk": "low",
    "citation_ok": true,
    "missing": [
        "Who has a higher Special Attack - Charizard or Pikachu?",
        "What are Charizard's types?",
        "What is Pikachu's base speed stat?"
    ],
    "generation": "Charizard has a base Attack stat of 84, while Pikachu has a base Attack stat of 55. Therefore, Charizard has a more powerful normal type attack."

17:11:12.324 Hybrid search called with query: Who has a higher Special Attack - Charizard or Pikachu?
17:11:13.391 Hybrid search called with query: What are Charizard's types?
17:11:14.576 Hybrid search called with query: What is Pikachu's base speed stat?
17:11:15.210 gen run
17:11:15.210   chat google/gemini-2.5-pro
17:11:28.238 crit run
17:11:28.238   chat google/gemini-2.5-pro
17:11:40.624 
    "step": 3,
    "correctness": 1.0,
    "risk": "low",
    "citation_ok": true,
    "missing": [
        "What is the difference between Pikachu and Partner Pikachu?",
        "What is Charizard's Special Attack stat?",
        "Which Pok\u00e9mon is faster, Charizard or Pikachu?"
    ],
    "generation": "Charizard has a more powerful normal attack than Pikachu, with a base attack stat of 84, compared to Pikachu's 55. [charizard.md, pikachu.md] Partner Pikachu, however, has an attack stat of 80. [pikachu.md]"

17:11:40.624 Hybrid search called with query: What is the difference between Pikachu and Partner Pikachu?
17:11:42.834 Hybrid search called with query: What is Charizard's Special Attack stat?
17:11:43.506 Hybrid search called with query: Which Pokémon is faster, Charizard or Pikachu?

Final:
 Charizard has a more powerful normal attack than Pikachu, with a base attack stat of 84, compared to Pikachu's 55.
Partner Pikachu, however, has an attack stat of 80.

Used queries:

[
    'Pikachu attack stats',
    'Charizard attack stats',
    'Pikachu vs Charizard attack power',
    'Who has a higher Special Attack - Charizard or Pikachu?',
    "What are Charizard's types?",
    "What is Pikachu's base speed stat?",
    'What is the difference between Pikachu and Partner Pikachu?',
    "What is Charizard's Special Attack stat?",
    'Which Pokémon is faster, Charizard or Pikachu?'
]

History:

[
    {'step': 1, 'correctness': 1.0, 'risk': 'low', 'citation_ok': True},
    {'step': 2, 'correctness': 1.0, 'risk': 'low', 'citation_ok': True},
    {'step': 3, 'correctness': 1.0, 'risk': 'low', 'citation_ok': True}
]

🧩 Late Chunking — Adaptive Retrieval Without Preprocessing Overhead#

Up to this point, all our retrieval methods assumed we had pre-chunked the entire corpus ahead of time. While effective, this approach can be wasteful — it embeds and stores thousands of text fragments even for small knowledge bases. In large-scale systems, pre-chunking becomes expensive in both storage and embedding cost.

To overcome this, we now explore Late Chunking — also called Dynamic or On-Demand Chunking.

⚙️ What is Late Chunking?#

Instead of embedding every document in advance, we:

Embed entire documents at a coarse level (1 vector per document).
When a query arrives:
- Rank documents by similarity to the query embedding.
- Select top-N documents likely to contain relevant information.
- Chunk and embed only those documents, then rerank their chunks by semantic similarity.
Return the top-k most relevant chunks as context for the answering agent.

This approach shifts the chunking process after initial retrieval — hence the name Late Chunking.

💡 Why it helps#

Benefit	Explanation
💰 Efficiency	Only a few documents are chunked and embedded per query → major cost savings.
⚡ Speed	Avoids loading or embedding a large number of irrelevant chunks.
🎯 Precision	Focuses chunking effort on documents already deemed semantically relevant.
🧠 Scalability	Suitable for large corpora or dynamic datasets (e.g., fresh documents, evolving knowledge bases).

More details in the paper by Gunther et al. (2024).

🧠 What this code does#

Builds lightweight document-level embeddings (1 vector per .md file).
Defines late_chunk_search():
- Retrieves top documents based on query–doc similarity.
- Chunks only those documents and embeds them on the fly.
- Reranks chunks to surface the most semantically relevant passages.
Wraps it as a tool late_chunk_context() used by latechunking_agent.

It typically achieves similar accuracy with a fraction of the compute and memory footprint.

def _embed(texts: List[str]) -> List[List[float]]:
    return [e.embedding for e in client.embeddings.create(model=EMBED_MODEL, input=texts).data]

def _cos(a, b): 
    a, b = np.array(a), np.array(b); return float(a @ b / (np.linalg.norm(a)*np.linalg.norm(b) + 1e-9))

def load_docs(dir_path: str) -> List[Dict]:
    return [{"filename": f, "content": open(os.path.join(dir_path, f), encoding="utf-8").read()}
            for f in os.listdir(dir_path) if f.endswith(".md")]

# 1) Build a lightweight doc-level index in memory (no pre-chunking)
docs = load_docs(str(DATA_DIR))
doc_vecs = _embed([d["content"] for d in docs])

# 2) Late-chunk retrieval: rank docs by query→doc similarity, then chunk only top docs and rerank chunks
def late_chunk_search(query: str, top_docs=3, chunk_size=700, overlap=120, top_chunks=6) -> List[Dict]:
    logfire.info(f'Late chunking search with query: {query}')
    qv = _embed([query])[0]

    # rank full documents
    doc_scores = [(_cos(qv, v), i) for i, v in enumerate(doc_vecs)]
    top_doc_idxs = [i for _, i in sorted(doc_scores, reverse=True)[:top_docs]]

    # chunk only selected docs
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
    chunks, metas = [], []
    for i in top_doc_idxs:
        for c in splitter.split_text(docs[i]["content"]):
            chunks.append(c); metas.append({"filename": docs[i]["filename"]})

    # rerank chunks by query similarity (embed chunks once)
    chunk_vecs = _embed(chunks) if chunks else []
    ranked = sorted(
        [{"content": c, "metadata": m, "_score": _cos(qv, v)} for c, m, v in zip(chunks, metas, chunk_vecs)],
        key=lambda x: x["_score"], reverse=True
    )
    
    return ranked[:top_chunks]

# 3) Build a context string for your RAG agent
def late_chunk_context(query: str, **kwargs) -> str:
    hits = late_chunk_search(query, **kwargs)
    return build_context_from_results(hits)


latechunking_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[late_chunk_context],
    retries=3
)

q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = latechunking_agent.run_sync(q)

rprint(rag_response)

13:47.398 latechunking_agent run
13:47.414   chat x-ai/grok-4-fast
13:51.037   running 1 tool
13:51.037     running tool: late_chunk_context
13:51.037       Late chunking search with query: Charizard vs Pikachu normal type moves power
13:57.576   chat x-ai/grok-4-fast
13:59.041   running 1 tool
13:59.042     running tool: late_chunk_context
13:59.042       Late chunking search with query: Charizard normal type moves
14:06.736   chat x-ai/grok-4-fast
14:08.616   running 1 tool
14:08.624     running tool: late_chunk_context
14:08.624       Late chunking search with query: Pikachu normal type moves power
14:17.083   chat x-ai/grok-4-fast
14:21.135   running 1 tool
14:21.135     running tool: late_chunk_context
14:21.135       Late chunking search with query: strongest normal type move Charizard Pikachu base power
14:35.013   chat x-ai/grok-4-fast

AgentRunResult(
    output=RAGAnswer(
        answer='Based on the provided context, specific details on the base power of normal-type moves for 
Charizard and Pikachu are not directly listed in the retrieved chunks. However, from general Pokémon knowledge 
referenced in the database structures, Charizard can learn higher-base-power moves like Slash (70) naturally, while
Pikachu typically learns lower-power moves like Quick Attack (40). Thus, Charizard has a more powerful normal-type 
attack.',
        used_tool='vector',
        citation='[charizard.md] [pikachu.md]'
    )
)

🏁 Conclusion — From Simple RAG to Adaptive, Agentic Retrieval#

In this tutorial, we built a complete end-to-end Retrieval-Augmented Generation (RAG) pipeline — starting from basic semantic search to advanced agentic techniques that reason about how to retrieve.

In the next tutorial, we’ll go one level deeper into relational reasoning through GraphRAG — a paradigm that connects retrieved knowledge not just by similarity, but by semantic relationships and causal links.

🧩 Key takeaway#

RAG is not just about retrieval — it’s about reasoning with evidence. From hybrid searches to self-reflective loops, each enhancement makes the agent more reliable, interpretable, and adaptive.

Next stop: GraphRAG — where your agents will not just fetch information, but understand relationships, infer causality, and build knowledge networks.

🕸️ From chunks → to connections → to cognition.

04. Retrieval-Augmented Generation

Contents

04. Retrieval-Augmented Generation#

Scenario: Pokémon!#

Pokédex data

Preparing our Knowledge Base#

Building our Vector Database#

🔍 What is LanceDB?#

🧠 What we’ll do here#

Searching the Knowledge Base — Semantic vs Keyword Search#

🧭 What is Semantic Search?#

🧩 Three search modes we’ll explore#

Evaluating Retrieval Quality#

🧩 Why Evaluation Matters#

📊 Metrics - Coverage, Recall, etc.#

🔎 Interpreting the Results#

Improving Precision — What is Reranking and Why It Helps#

🔧 Packaging Retrieval as “Tools” for Agents#

🔮 From Plain LLM to RAG-Enhanced Agent#

🧠 Two Agents, Two Worlds#

⚔️ The Test#

🧠 Building a Smarter Agent — Multi-Tool Retrieval and Dynamic Reasoning#

🛠️ Enter the Multi-Tool Agent#

🧩 Contextualised Retrieval — Using an LLM to Summarize Retrieved Evidence#

🧠 How this works#

🎯 Why this matters#

🔄 Multi-Query Retrieval — Expanding Recall Through Paraphrased Queries#

🧩 How it works#

⚙️ What this code does#

🔁 Iterative Retrieval with FLARE#

⚙️ How FLARE works#

🧠 Key concepts demonstrated#

🔬 What this code does#

🧭 Self-RAG — Self-Reflective Retrieval and Generation#

🧠 What is Self-RAG?#

⚙️ What this code does#

🎯 Why Self-RAG matters#

🧩 Late Chunking — Adaptive Retrieval Without Preprocessing Overhead#

⚙️ What is Late Chunking?#

💡 Why it helps#

🧠 What this code does#

🏁 Conclusion — From Simple RAG to Adaptive, Agentic Retrieval#

🧩 Key takeaway#