04. Retrieval-Augmented Generation#
Large Language Models (LLMs) are brilliant generalists — they’ve read the internet and can reason across domains — but they don’t know what they haven’t seen. Their parameters store general knowledge, not private, up-to-date, or domain-specific facts. Retrieval-Augmented Generation (RAG) bridges that gap. It combines:
Retrieval – find relevant information from an external knowledge base (e.g., docs, databases, websites).
Generation – pass that retrieved context into an LLM to ground its answer.
This simple loop — retrieve → augment → generate — makes the model:
More accurate (uses real facts, not hallucinations)
More current (retrieval can include recent or proprietary data)
Cheaper & smaller (you don’t need to fine-tune large models for every dataset)
Explainable (you can trace answers back to the retrieved sources)
RAG is now the foundation of modern enterprise AI systems, powering products like search-chat hybrids, coding copilots, knowledge assistants, and customer-support bots. In short: RAG makes LLMs grounded, trustworthy, and useful in the real world.
I highly recommend watching explanations of RAG from IBM and Cole Medin.
Scenario: Pokémon!#
Let’s take something seemingly simple — asking questions about Pokémon species like Pikachu, Charizard, or Mewtwo. At first glance, LLMs might seem to know this, but there are hidden challenges:
Problem |
Why it’s hard for an LLM |
|---|---|
Data freshness |
Game mechanics, move sets, and forms change with every generation — LLMs trained on older data may be outdated. |
Structured facts |
Evolution trees, base stats, and type matchups are stored in tables, not prose — hard for models to memorize precisely. |
Ambiguity |
Words like “form”, “Mega Evolution”, “TM”, or “base stats” require domain-specific interpretation. |
Compositional queries |
“Which Pokémon evolves into Pikachu?” or “List Charizard’s Mega forms and their base stats” require multiple lookups and reasoning steps. |
When we ask these zero-shot, even the best LLMs often hallucinate:
inventing fake evolution lines,
mixing up stats across generations,
or returning vague, generic answers.
That’s where RAG shines:
We retrieve the real Pokémon data (from pokemondb.net in this tutorial).
We chunk and embed those markdown pages in a vector database (LanceDB).
Then, for each query, we retrieve the most relevant chunks and let the LLM reason grounded in evidence.
So instead of guessing, our agent reads and reasons. This setup scales naturally to enterprise settings — from Pokémon encyclopedias to product catalogs, regulatory documents, or customer knowledge bases.
For our data, we use the PokemonDB. We’ll fetch: pichu, pikachu, raichu, charizard, mewtwo, slowpoke and save as .md. These pages are HTML; we’ll convert to Markdown for easier chunking.
import requests, pathlib
from markdownify import markdownify as mdify
# Saving data for common pokemons
POKEMON = [
("pichu", "https://pokemondb.net/pokedex/pichu"),
("pikachu", "https://pokemondb.net/pokedex/pikachu"),
("raichu", "https://pokemondb.net/pokedex/raichu"),
("charizard", "https://pokemondb.net/pokedex/charizard"),
("mewtwo", "https://pokemondb.net/pokedex/mewtwo"),
("slowpoke", "https://pokemondb.net/pokedex/slowpoke"),
]
def fetch_markdown(url: str) -> str:
html = requests.get(url, timeout=30).text
md = mdify(html, heading_style="ATX")
return md
DATA_DIR = pathlib.Path("./data/pokemon_md")
downloaded = []
for name, url in POKEMON:
md_text = fetch_markdown(url)
path = DATA_DIR / f"{name}.md"
path.write_text(md_text, encoding="utf-8")
downloaded.append((name, str(path), url))
print(f"Saved {len(downloaded)} markdown files → {DATA_DIR}")
Saved 6 markdown files → data\pokemon_md
Let’s see what a sample of this data page looks like.
from IPython.display import Markdown, display
import pathlib
md_path = pathlib.Path("./data/pokemon_md/pikachu.md")
display(Markdown(md_path.read_text(encoding="utf-8")[3000:4000])) # first 2000 chars
Pokédex data
| | | | — | — | | National № | 0025 | | Type | Electric | | Species | Mouse Pokémon | | Height | 0.4 m (1′04″) | | Weight | 6.0 kg (13.2 lbs) | | Abilities | 1. Static Lightning Rod (hidden ability) | | Local № | 0025 (Yellow/Red/Blue) 0022 (Gold/Silver/Crystal) 0156 (Ruby/Sapphire/Emerald) 0025 (FireRed/LeafGreen) 0104 (Diamond/Pearl) 0104 (Platinum) 0022 (HeartGold/SoulSilver) 0036 (X/Y — Central Kalos) 0163 (Omega Ruby/Alpha Sapphire) 0025 (Sun/Moon — Alola dex) 0032 (U.Sun/U.Moon — Alola dex) 0025 (Let’s Go Pikachu/Let’s Go Eevee) 0194 (Sword/Shield) 0104 (Brilliant Diamond/Shining Pearl) 0056 (Legends: Arceus) 0074
Preparing our Knowledge Base#
Now that we’ve downloaded Pokémon data as .md files (for Pikachu, Charizard, Mewtwo, etc.),
we need to split the text into smaller chunks before embedding it into a vector database.
Why?
LLMs and embeddings have context limits — we can’t feed the entire document at once.
Smaller, semantically coherent chunks help retrieval systems match relevant sections precisely.
Chunking also improves Recall@k, latency, and embedding reuse during updates.
We’ll try two common splitting strategies:
Splitter |
Description |
When to use |
|---|---|---|
🧩 RecursiveCharacterTextSplitter |
Splits text purely by length, preserving overlap. |
Generic text without structure. |
🧱 MarkdownHeaderTextSplitter |
Splits along Markdown headers ( |
Structured content (docs, wikis, pages like Pokémon DB). |
After chunking, we’ll have two parallel sets of documents:
docs_rec: recursively chunked plain textdocs_md: structure-aware markdown chunks
These will later be embedded into LanceDB and compared for retrieval quality.
from typing import List, Dict, Any, Optional, Tuple
import os
from langchain_text_splitters import (
MarkdownHeaderTextSplitter,
RecursiveCharacterTextSplitter,
)
# --- Chunking params ---
CHUNK_SIZE = 700
CHUNK_OVERLAP = 120
# --- Eval/profiling ---
EVAL_K_LIST = [1, 3, 5]
EMBEDDING_COST_PER_1K = float(os.getenv("EMBED_COST_PER_1K", "0.00013")) # USD
PRINT_TOP_N = 5
def read_files_as_object_array(directory_path: str) -> List[Dict[str, str]]:
out = []
for fname in os.listdir(directory_path):
fpath = os.path.join(directory_path, fname)
if os.path.isfile(fpath):
with open(fpath, "r", encoding="utf-8") as f:
out.append({"filename": fname, "content": f.read()})
return out
def recursive_text_splitter(data, chunk_size, overlap_size):
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=overlap_size, length_function=len, is_separator_regex=False
)
texts = splitter.create_documents(
[f"{d['filename']}\n{d['content']}" for d in data],
metadatas=[{"filename": d["filename"]} for d in data],
)
return texts
def markdown_splitter(data, chunk_size, overlap_size):
md_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")], strip_headers=True
)
md_splits = []
for d in data:
splits = md_splitter.split_text(d["content"])
for s in splits:
s.metadata["filename"] = d["filename"]
md_splits.extend(splits)
size_limiter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=overlap_size, length_function=len, is_separator_regex=False
)
return size_limiter.split_documents(md_splits)
docs_raw = read_files_as_object_array(str(DATA_DIR))
docs_rec = recursive_text_splitter(docs_raw, CHUNK_SIZE, CHUNK_OVERLAP)
docs_md = markdown_splitter(docs_raw, CHUNK_SIZE, CHUNK_OVERLAP)
print(f"Recursive chunks: {len(docs_rec)} | Markdown+size chunks: {len(docs_md)}")
Recursive chunks: 526 | Markdown+size chunks: 509
Building our Vector Database#
Before our agent can “retrieve” knowledge, we need a database that understands vectors — numerical representations of text meaning (embeddings). That’s where LanceDB comes in.
🔍 What is LanceDB?#
LanceDB is a lightweight, local-first vector database built on the Lance columnar format.
It’s designed for:
Storing and searching high-dimensional embeddings (like text or image vectors).
Performing semantic similarity queries (e.g., “find texts most similar to this query”).
Hybrid retrieval: combining full-text search (BM25 / Tantivy) and vector search.
Speed and simplicity — it runs locally (no separate server needed).
🧠 What we’ll do here#
Embed all Pokémon chunks using OpenRouter’s embedding model (
text-embedding-3-large).Create / connect to a LanceDB table named
"pokemon_pages".Insert each chunk’s text, vector, and metadata (like filename & splitter type).
Build a full-text search (FTS) index for keyword lookups alongside vector search.
After this step, we’ll have a ready-to-query LanceDB store — the foundation for our Retrieval-Augmented Generation (RAG) pipeline.
from dotenv import load_dotenv
from openai import OpenAI
import lancedb
import uuid
load_dotenv()
OPENAI_BASE_URL = "https://openrouter.ai/api/v1"
EMBED_MODEL = os.getenv("EMBEDDINGS_MODEL", "qwen/qwen3-embedding-8b")
client = OpenAI(base_url=OPENAI_BASE_URL, api_key=os.getenv('OPENROUTER_API_KEY'))
DB_URI = "./db/sample-lancedb"
TABLE_NAME_TMP = "pokemon_pages_tmp"
TABLE_NAME = "pokemon_pages"
def embed_texts(texts: List[str], model: str = EMBED_MODEL, batch_size: int = 64) -> List[List[float]]:
r"""
Returns a list of embedding vectors. Uses OpenAI-compatible client pointed at OpenRouter.
r"""
out = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
resp = client.embeddings.create(model=model, input=batch)
out.extend([e.embedding for e in resp.data])
return out
db = lancedb.connect(DB_URI)
try:
tbl = db.open_table(TABLE_NAME_TMP)
print(f"Loaded LanceDB at {DB_URI} (table={TABLE_NAME_TMP})")
except:
all_chunks = []
for d in docs_rec:
all_chunks.append({"id": str(uuid.uuid4()), "content": d.page_content,
"metadata": {"filename": d.metadata.get("filename",""), "splitter": "recursive"}})
for d in docs_md:
all_chunks.append({"id": str(uuid.uuid4()), "content": d.page_content,
"metadata": {"filename": d.metadata.get("filename",""), "splitter": "markdown"}})
print("Embedding chunks...")
vectors = embed_texts([c["content"] for c in all_chunks])
for c, v in zip(all_chunks, vectors):
c["vector"] = v
tbl = db.create_table(TABLE_NAME_TMP, data=all_chunks)
tbl.create_fts_index("content")
print(f"Indexed {len(all_chunks)} chunks into LanceDB at {DB_URI} (table={TABLE_NAME_TMP})")
Loaded LanceDB at ./db/sample-lancedb (table=pokemon_pages_tmp)
Searching the Knowledge Base — Semantic vs Keyword Search#
Now that our Pokémon chunks are stored in LanceDB, let’s learn how to search through them.
🧭 What is Semantic Search?#
Traditional search engines (like keyword or BM25 search) match exact words or phrases in your query. But LLMs and embeddings represent meaning as vectors in high-dimensional space — a semantic space.
In semantic search, we:
Embed the query into a vector (using the same embedding model as our database).
Measure its closeness to all stored vectors (chunks) — using cosine similarity or dot product.
Retrieve the most semantically similar chunks, even if they don’t share exact words.
For example:
Query → “Who evolves into Pikachu?”
Closest text → “Pichu evolves into Pikachu when leveled up with high friendship.”
Even if the word “who” or “friendship” doesn’t appear in both, their embeddings are close in the semantic space, allowing meaning-based retrieval. I recommend watching video on vector search by IBM.
🧩 Three search modes we’ll explore#
Method |
Description |
Strength |
|---|---|---|
🔡 FTS (Full Text Search) |
Matches literal terms using BM25 (like keyword search). |
Great for rare names, exact filters, or numeric queries. |
🧠 Vector Search |
Uses embedding similarity in high-dimensional space. |
Captures meaning, paraphrases, and context. |
⚡ Hybrid Search |
Fuses both (via Reciprocal Rank Fusion). |
Balances precision (FTS) and recall (semantic). |
The next cell defines functions for each search type and prints their top results side by side —
so you can see how semantic closeness changes the quality of retrieval.
from rich import print as rprint
def perform_vector_search(query: str, pokemon: Optional[str] = None, top_k: int = 5):
emb = client.embeddings.create(model=EMBED_MODEL, input=[query]).data[0].embedding
qb = tbl.search(emb).metric('cosine').limit(top_k).select(["content", "metadata", "_distance", "vector"])
if pokemon:
qb = qb.where(f"metadata.filename = '{pokemon}.md'")
return qb.to_list()
def perform_fts_search(query: str, pokemon: Optional[str] = None, top_k: int = 5):
qb = tbl.search(query, query_type="fts").limit(top_k).select(["content", "metadata", "_score", "vector"])
if pokemon:
qb = qb.where(f"metadata.filename = '{pokemon}.md'", prefilter=True)
return qb.to_list()
def reciprocal_rank_fusion(results_a, results_b, k: int = 60):
def rid(x): return hash(x["content"])
scores = {}
for i, r in enumerate(results_a):
scores[rid(r)] = scores.get(rid(r), 0) + 1.0/(k+i+1)
for i, r in enumerate(results_b):
scores[rid(r)] = scores.get(rid(r), 0) + 1.0/(k+i+1)
uniq = {}
for r in results_a + results_b:
uniq[rid(r)] = r
ranked = sorted(uniq.values(), key=lambda r: scores[hash(r['content'])], reverse=True)
return ranked
def perform_hybrid_search(query: str, pokemon: Optional[str] = None, top_k: int = 5):
vres = perform_vector_search(query, pokemon, top_k=top_k)
fres = perform_fts_search(query, pokemon, top_k=top_k)
fused = reciprocal_rank_fusion(vres, fres)[:top_k]
return fused
queries = [
"Which Pokémon evolves into Pikachu?",
"Show Mega evolutions for Charizard",
]
for q in queries:
rprint(f"\n[bold green]Query:[/] {q}")
v = perform_vector_search(q, top_k=3)
f = perform_fts_search(q, top_k=3)
h = perform_hybrid_search(q, top_k=3)
rprint("[magenta]Vector top1:[/]", v[0]["metadata"]["filename"], "→", v[0]["content"][:120].replace("\n"," "))
rprint("[magenta]FTS top1:[/]", f[0]["metadata"]["filename"], "→", f[0]["content"][:120].replace("\n"," "))
rprint("[magenta]Hybrid top1:[/]", h[0]["metadata"]["filename"], "→", h[0]["content"][:120].replace("\n"," "))
Query: Which Pokémon evolves into Pikachu?
Vector top1: raichu.md → | | | | --- | --- | | Red | [Cerulean Cave](/location/kanto-cerulean-cave) | | Blue | [Cerulean Cave](/location/kanto-
FTS top1: pichu.md → * [What is the average level for pokemon to evolve by happiness?](/pokebase/483/what-is-the-average-level-for-pokemon-to
Hybrid top1: raichu.md → | | | | --- | --- | | Red | [Cerulean Cave](/location/kanto-cerulean-cave) | | Blue | [Cerulean Cave](/location/kanto-
Query: Show Mega evolutions for Charizard
Vector top1: charizard.md → ## Moves learned by Charizard
FTS top1: charizard.md → The design of *Charizard* appears to draw inspiration from dragons of European
folklore, with its large wings, horned he
Hybrid top1: charizard.md → ## Moves learned by Charizard
Evaluating Retrieval Quality#
Once our Pokémon chunks are embedded and searchable, we need to measure how well the retrieval step is working. Even the best LLM can only answer correctly if the right information was fetched first.
🧩 Why Evaluation Matters#
RAG systems rely on two main components:
Retrieval – finding the most relevant chunks from the knowledge base.
Generation – the LLM reasoning over those chunks to answer questions.
If retrieval fails (missing or irrelevant chunks), generation will inevitably fail too — no matter how smart the model is.
That’s why retrieval metrics are critical for diagnosing performance.
📊 Metrics - Coverage, Recall, etc.#
Metric |
What it measures |
Why it matters |
|---|---|---|
Coverage Ratio |
How much of the original document text is preserved in the chunked dataset. |
Ensures chunking didn’t lose too much information. |
Recall@k |
Whether at least one relevant chunk appears in the top-k retrieved results. |
Tests if the search finds what we need (completeness). |
MRR (Mean Reciprocal Rank) |
How early in the ranking the first relevant chunk appears. |
Rewards search methods that bring correct answers to the top. |
Latency (later) |
Time taken for each search query. |
Balances quality vs speed for production systems. |
In the next cell, we’ll start with coverage statistics — verifying that our chunking step retains most of the source content for both splitters (recursive and markdown).
This acts as a sanity check before moving on to deeper retrieval evaluation.
import pandas as pd
GROUND_TRUTH = {
"Which Pokémon evolves into Pikachu?": ["pichu.md"],
"Which Pokémon learns Volt Tackle via breeding/light ball mechanics?": ["pikachu.md", "pichu.md"],
"Show Mega evolutions for Charizard": ["charizard.md"],
"Base stats of Mewtwo": ["mewtwo.md"],
"What is Mewtwo’s base stat total (BST)?": ["mewtwo.md"],
"What is Slowpoke's type?": ["slowpoke.md"],
"What moves can Raichu learn by TM?": ["raichu.md"],
}
def coverage_stats(docs_raw, chunks) -> Dict[str, float]:
total_chars = sum(len(d["content"]) for d in docs_raw)
chunk_chars = sum(len(c.page_content) for c in chunks)
return {
"total_chars": total_chars,
"chunk_chars": chunk_chars,
"coverage_ratio": chunk_chars / total_chars if total_chars else 0.0
}
cov_rec = coverage_stats(docs_raw, docs_rec)
cov_md = coverage_stats(docs_raw, docs_md)
pd.DataFrame([
{"splitter": "recursive", **cov_rec},
{"splitter": "markdown", **cov_md},
])
| splitter | total_chars | chunk_chars | coverage_ratio | |
|---|---|---|---|---|
| 0 | recursive | 249998 | 264501 | 1.058012 |
| 1 | markdown | 249998 | 259329 | 1.037324 |
import time
def eval_search(queries: List[str], search_fn, ks=(1,3,5)) -> pd.DataFrame:
rows = []
for q in queries:
t0 = time.time()
results = search_fn(q, top_k=max(ks))
elapsed = time.time() - t0
filenames = [r["metadata"]["filename"] for r in results]
gt = set(GROUND_TRUTH[q])
recs = {}
for k in ks:
recs[f"Recall@{k}"] = 1.0 if any(f in gt for f in filenames[:k]) else 0.0
rr = 0.0
for i, f in enumerate(filenames, start=1):
if f in gt:
rr = 1.0 / i
break
rows.append({"query": q, "latency_ms": round(1000*elapsed,2), "MRR": rr, **recs})
return pd.DataFrame(rows)
df_vec = eval_search(list(GROUND_TRUTH.keys()), perform_vector_search, ks=tuple(EVAL_K_LIST))
df_fts = eval_search(list(GROUND_TRUTH.keys()), perform_fts_search, ks=tuple(EVAL_K_LIST))
df_hyb = eval_search(list(GROUND_TRUTH.keys()), perform_hybrid_search, ks=tuple(EVAL_K_LIST))
summary = pd.DataFrame({
"Method": ["Vector","FTS","Hybrid"],
"MRR(mean)": [df_vec["MRR"].mean(), df_fts["MRR"].mean(), df_hyb["MRR"].mean()],
**{f"Recall@{k}(mean)": [df_vec[f"Recall@{k}"].mean(), df_fts[f"Recall@{k}"].mean(), df_hyb[f"Recall@{k}"].mean()] for k in EVAL_K_LIST},
"Latency(ms, mean)": [df_vec["latency_ms"].mean(), df_fts["latency_ms"].mean(), df_hyb["latency_ms"].mean()],
}).round(3)
rprint("[bold]Per-query (Vector):[/]"); display(df_vec)
rprint("[bold]Per-query (FTS):[/]"); display(df_fts)
rprint("[bold]Per-query (Hybrid):[/]"); display(df_hyb)
rprint("[bold green]Summary:[/]"); display(summary)
Per-query (Vector):
| query | latency_ms | MRR | Recall@1 | Recall@3 | Recall@5 | |
|---|---|---|---|---|---|---|
| 0 | Which Pokémon evolves into Pikachu? | 586.07 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | Which Pokémon learns Volt Tackle via breeding/... | 501.79 | 0.5 | 0.0 | 1.0 | 1.0 |
| 2 | Show Mega evolutions for Charizard | 4004.79 | 1.0 | 1.0 | 1.0 | 1.0 |
| 3 | Base stats of Mewtwo | 464.45 | 1.0 | 1.0 | 1.0 | 1.0 |
| 4 | What is Mewtwo’s base stat total (BST)? | 533.87 | 1.0 | 1.0 | 1.0 | 1.0 |
| 5 | What is Slowpoke's type? | 886.77 | 1.0 | 1.0 | 1.0 | 1.0 |
| 6 | What moves can Raichu learn by TM? | 2282.72 | 1.0 | 1.0 | 1.0 | 1.0 |
Per-query (FTS):
| query | latency_ms | MRR | Recall@1 | Recall@3 | Recall@5 | |
|---|---|---|---|---|---|---|
| 0 | Which Pokémon evolves into Pikachu? | 33.01 | 1.0 | 1.0 | 1.0 | 1.0 |
| 1 | Which Pokémon learns Volt Tackle via breeding/... | 17.15 | 1.0 | 1.0 | 1.0 | 1.0 |
| 2 | Show Mega evolutions for Charizard | 0.00 | 1.0 | 1.0 | 1.0 | 1.0 |
| 3 | Base stats of Mewtwo | 19.67 | 1.0 | 1.0 | 1.0 | 1.0 |
| 4 | What is Mewtwo’s base stat total (BST)? | 13.68 | 1.0 | 1.0 | 1.0 | 1.0 |
| 5 | What is Slowpoke's type? | 18.08 | 1.0 | 1.0 | 1.0 | 1.0 |
| 6 | What moves can Raichu learn by TM? | 15.24 | 0.0 | 0.0 | 0.0 | 0.0 |
Per-query (Hybrid):
| query | latency_ms | MRR | Recall@1 | Recall@3 | Recall@5 | |
|---|---|---|---|---|---|---|
| 0 | Which Pokémon evolves into Pikachu? | 533.85 | 0.333333 | 0.0 | 1.0 | 1.0 |
| 1 | Which Pokémon learns Volt Tackle via breeding/... | 500.46 | 1.000000 | 1.0 | 1.0 | 1.0 |
| 2 | Show Mega evolutions for Charizard | 667.26 | 1.000000 | 1.0 | 1.0 | 1.0 |
| 3 | Base stats of Mewtwo | 515.35 | 1.000000 | 1.0 | 1.0 | 1.0 |
| 4 | What is Mewtwo’s base stat total (BST)? | 487.06 | 1.000000 | 1.0 | 1.0 | 1.0 |
| 5 | What is Slowpoke's type? | 4685.00 | 1.000000 | 1.0 | 1.0 | 1.0 |
| 6 | What moves can Raichu learn by TM? | 491.01 | 1.000000 | 1.0 | 1.0 | 1.0 |
Summary:
| Method | MRR(mean) | Recall@1(mean) | Recall@3(mean) | Recall@5(mean) | Latency(ms, mean) | |
|---|---|---|---|---|---|---|
| 0 | Vector | 0.786 | 0.714 | 0.857 | 0.857 | 1322.923 |
| 1 | FTS | 0.857 | 0.857 | 0.857 | 0.857 | 16.690 |
| 2 | Hybrid | 0.905 | 0.857 | 1.000 | 1.000 | 1125.713 |
🔎 Interpreting the Results#
TL;DR: Hybrid wins on quality; FTS wins on speed.
Hybrid (MRR=0.90, Recall@3/5=1.0): Best overall retrieval quality. Reciprocal Rank Fusion (RRF) captures semantic matches that FTS misses while still surfacing exact-term hits. Ideal default for general-purpose RAG.
Vector (MRR=0.78, Recall@5=0.85, ~1300 ms): Strong semantic coverage—great when users paraphrase. Slightly slower due to embedding + nearest-neighbor search.
FTS (MRR=0.85, Recall@k ≤ 0.86, ~17 ms): Blazing fast and excels for exact names, forms, numbers (e.g., “TM”, “Mega”). But it can miss paraphrases or semantic matches.
What to deploy
Default: Hybrid.
Query routing: Use FTS for quoted phrases/IDs/numerics; otherwise Hybrid.
Latency-sensitive paths: FTS with a semantic fallback on low-confidence.
Improving Precision — What is Reranking and Why It Helps#
Even after combining vector and keyword search, our top results may still include partially relevant or redundant chunks. That’s where reranking comes in — a crucial final step in the retrieval pipeline.
🎯 What is Reranking? Reranking means taking the initial set of retrieved results (e.g., top 20) and reordering them using a more accurate relevance model. This model computes a finer-grained similarity between the query and each retrieved chunk.
Common reranking approaches:
Embedding-based cosine similarity (lightweight) — compares the query vector with each chunk’s vector (as we’ll do here).
Cross-encoder models (heavier) — feed
[query, passage]pairs into an LLM or BERT-like model for deeper contextual matching.
💡 Why Reranking Helps
First-stage retrieval (vector/FTS/hybrid) is optimized for speed, not precision.
Reranking refines the order to push the most semantically aligned chunks to the top, improving MRR and answer faithfulness.
It’s especially useful when:
Many chunks share overlapping content.
The query is nuanced or multi-faceted (e.g., “Mega evolutions and base stats of Charizard”).
You plan to feed only a few chunks into the LLM for generation.
In the next cell, we’ll apply a simple cosine-similarity-based reranker that reorders hybrid search results using the query’s embedding — a fast and effective upgrade for small to mid-sized RAG systems.
import numpy as np
def cosine(a, b):
a = np.array(a); b = np.array(b)
return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-9))
def rerank_by_query_vector(query: str, results: List[Dict[str, Any]], top_k: int = 5):
"""
Rerank retrieved results based on cosine similarity
between the query embedding and each result’s embedding vector.
"""
qv = client.embeddings.create(model=EMBED_MODEL, input=[query]).data[0].embedding
rescored = []
for r in results:
rescored.append((cosine(qv, r['vector']), r))
rescored = sorted(rescored, key=lambda x: x[0], reverse=True)
rescored = [r for _, r in rescored[:top_k]]
results, mds = [], set()
for r in rescored:
if r['metadata']['filename'] in mds: continue
mds.add(r['metadata']['filename']); results.append(r)
return results
def perform_hybrid_rerank(query: str, pokemon: Optional[str] = None, top_k: int = 5):
fused = perform_hybrid_search(query, pokemon, top_k=top_k*10)
return rerank_by_query_vector(query, fused, top_k=top_k)
df_hyr = eval_search(list(GROUND_TRUTH.keys()), perform_hybrid_rerank, ks=tuple(EVAL_K_LIST))
summary = pd.DataFrame({
"Method": ["Vector","FTS","Hybrid","Reranking"],
"MRR(mean)": [df_vec["MRR"].mean(), df_fts["MRR"].mean(), df_hyb["MRR"].mean(), df_hyr["MRR"].mean()],
**{f"Recall@{k}(mean)": [df_vec[f"Recall@{k}"].mean(), df_fts[f"Recall@{k}"].mean(), df_hyb[f"Recall@{k}"].mean(), df_hyr[f"Recall@{k}"].mean()] for k in EVAL_K_LIST},
"Latency(ms, mean)": [df_vec["latency_ms"].mean(), df_fts["latency_ms"].mean(), df_hyb["latency_ms"].mean(), df_hyr["latency_ms"].mean()],
}).round(3)
rprint("[bold]Per-query (Hybrid + Rerank):[/]"); display(df_hyr)
rprint("[bold green]Summary:[/]"); display(summary)
Per-query (Hybrid + Rerank):
| query | latency_ms | MRR | Recall@1 | Recall@3 | Recall@5 | |
|---|---|---|---|---|---|---|
| 0 | Which Pokémon evolves into Pikachu? | 2488.60 | 0.333333 | 0.0 | 1.0 | 1.0 |
| 1 | Which Pokémon learns Volt Tackle via breeding/... | 2057.19 | 1.000000 | 1.0 | 1.0 | 1.0 |
| 2 | Show Mega evolutions for Charizard | 1164.91 | 1.000000 | 1.0 | 1.0 | 1.0 |
| 3 | Base stats of Mewtwo | 1143.49 | 1.000000 | 1.0 | 1.0 | 1.0 |
| 4 | What is Mewtwo’s base stat total (BST)? | 1174.59 | 1.000000 | 1.0 | 1.0 | 1.0 |
| 5 | What is Slowpoke's type? | 1132.06 | 1.000000 | 1.0 | 1.0 | 1.0 |
| 6 | What moves can Raichu learn by TM? | 7774.14 | 1.000000 | 1.0 | 1.0 | 1.0 |
Summary:
| Method | MRR(mean) | Recall@1(mean) | Recall@3(mean) | Recall@5(mean) | Latency(ms, mean) | |
|---|---|---|---|---|---|---|
| 0 | Vector | 0.786 | 0.714 | 0.857 | 0.857 | 1322.923 |
| 1 | FTS | 0.857 | 0.857 | 0.857 | 0.857 | 16.690 |
| 2 | Hybrid | 0.905 | 0.857 | 1.000 | 1.000 | 1125.713 |
| 3 | Reranking | 0.905 | 0.857 | 1.000 | 1.000 | 2419.283 |
Takeaway: Reranking yields the highest retrieval precision (MRR↑) with nearly perfect recall, though at a higher latency cost. In practice, it’s often used as an optional second stage — applied only when the agent is uncertain or when quality matters more than speed.
🔧 Packaging Retrieval as “Tools” for Agents#
Now that we have multiple retrieval strategies — vector, FTS, hybrid — we’ll wrap them into simple, reusable tools that return formatted text contexts.
These tools will later be used by our PydanticAI agent to decide:
Which search mode to use (routing),
How much context to retrieve, and
When to combine multiple sources (reflection and fusion).
Let’s define these tool functions next.
import logfire
import nest_asyncio
nest_asyncio.apply()
logfire.configure(send_to_logfire=False) # set to true if you want to use logfire console
logfire.instrument_pydantic_ai()
try:
tbl = db.open_table(TABLE_NAME)
except:
tbl = db.open_table(TABLE_NAME_TMP)
def build_context_from_results(results: List[Dict[str,Any]]):
return "\n---\n".join([
f"Title: {r['metadata']['filename']}\nContent:\n{r['content']}"
for r in results
])
def tool_vector(query: str, k: int = 5) -> str:
"""Vector search"""
logfire.info(f"Vector search called with query: {query}")
res = perform_vector_search(query, top_k=k)
return build_context_from_results(res)
def tool_fts(query: str, k: int = 5) -> str:
"""Full Text Search"""
logfire.info(f"FTS search called with query: {query}")
res = perform_fts_search(query, top_k=k)
return build_context_from_results(res)
def tool_hybrid(query: str, k: int = 5) -> str:
"""Hybrid Search"""
logfire.info(f"Hybrid search called with query: {query}")
res = perform_hybrid_search(query, top_k=k)
return build_context_from_results(res)
def tool_rerank(query: str, k: int = 5) -> str:
"Reranking Search"
logfire.info(f"Reranking search called with query: {query}")
res = perform_hybrid_rerank(query, top_k=k)
return build_context_from_results(res)
Logfire project URL: https://logfire-eu.pydantic.dev/shreshthtuli/agenticai
🔮 From Plain LLM to RAG-Enhanced Agent#
Now that our retrieval tools are ready, let’s test how much they actually help the model think.
🧠 Two Agents, Two Worlds#
We’ll create two simple agents using PydanticAI:
Agent |
Description |
Data Access |
|---|---|---|
🧩 Vanilla Agent |
A plain LLM (e.g., Grok-4 or GPT-4) answering directly from its internal training data. |
❌ No external context |
📚 RAG Agent |
Same model, but grounded with retrieved Pokémon chunks from LanceDB. It must answer only from the provided context. |
✅ Uses hybrid search tool |
⚔️ The Test#
We’ll ask both agents the same question:
“Who has more powerful normal type attack — Charizard or Pikachu?”
The Vanilla Agent relies purely on what it “remembers.”
The RAG Agent, on the other hand, performs:
Retrieval — pulls relevant chunks from our local Pokémon corpus using
tool_hybrid.Grounded generation — answers based strictly on retrieved evidence and cites sources (e.g.,
[charizard.md]).
This comparison highlights how RAG agents can reduce hallucinations and provide traceable, verifiable answers even with small, domain-specific knowledge bases.
from pydantic_ai import Agent
from pydantic import BaseModel, Field
CHAT_MODEL = os.getenv("CHAT_MODEL", "openrouter:x-ai/grok-4-fast")
class VanillaAnswer(BaseModel):
answer: str = Field(description="Concise, factual answer for the given query.")
class RAGAnswer(BaseModel):
answer: str = Field(description="Concise, factual answer for the given query.")
used_tool: str = Field(description="Which tool was used: vector | fts | hybrid")
citation: str = Field(description="Filename used to generate response.")
vanilla_agent = Agent(
model=CHAT_MODEL,
system_prompt=(
"You area pokemon expert. Answer given questions"
),
output_type=VanillaAnswer,
retries=3
)
rag_agent = Agent(
model=CHAT_MODEL,
system_prompt=(
"You answer strictly from the provided CONTEXT. If unknown, say 'I don't know from the corpus'. "
"Always cite the filenames you relied on, e.g., [pikachu.md]."
),
output_type=RAGAnswer,
tools=[tool_hybrid],
retries=3
)
q = "Who has more powerful normal type attack - Charizard or Pikachu?"
vanilla_response = vanilla_agent.run_sync(q)
rprint(vanilla_response.output)
rag_response = rag_agent.run_sync(q)
rprint(rag_response)
11:20:50.065 vanilla_agent run
11:20:50.072 chat x-ai/grok-4-fast
VanillaAnswer(answer='Charizard')
11:20:52.825 rag_agent run
11:20:52.825 chat x-ai/grok-4-fast
11:20:55.145 running 1 tool
11:20:55.145 running tool: tool_hybrid
11:20:55.145 Hybrid search called with query: Charizard vs Pikachu normal type attacks power comparison
11:20:56.037 chat x-ai/grok-4-fast
11:20:59.007 running 1 tool
11:20:59.007 running tool: tool_hybrid
11:20:59.007 Hybrid search called with query: Pikachu normal type moves power
11:21:06.900 chat x-ai/grok-4-fast
11:21:09.728 running 1 tool
11:21:09.730 running tool: tool_hybrid
11:21:09.730 Hybrid search called with query: Pikachu moves list normal type power
11:21:10.272 chat x-ai/grok-4-fast
AgentRunResult( output=RAGAnswer( answer="Charizard has a more powerful Normal-type attack, with Slash at 70 power, while Pikachu's Normal-type moves are not specified in the provided context.", used_tool='hybrid', citation='charizard.md' ) )
In the logs above, you can see a clear difference:
🧩 Vanilla Agent:
Likely gave a vague or partially correct answer — it relies on its pretrained world knowledge, which may be outdated or incomplete.
It has no access to our curated Pokémon corpus, so its response can drift or even hallucinate.📚 RAG Agent (Correct Answer):
Retrieved the Charizard and Pikachu entries from our LanceDB knowledge base, analyzed their base attack stats,
and correctly identified that Charizard has the stronger Normal-type attack — with a source citation (e.g.,[charizard.md]).
This demonstrates the core benefit of Retrieval-Augmented Generation:
It grounds responses in real, verifiable data.
It produces contextually correct and source-traceable answers.
It reduces hallucinations and improves trustworthiness — especially in factual, domain-specific tasks.
In short, the RAG agent doesn’t guess — it knows where to look.
In the next cell, we’ll run the same question again and compare how the Reranking Agent responds.
Look for stronger alignment with retrieved facts and clearer source citations.
reranking_agent = Agent(
model=CHAT_MODEL,
system_prompt=(
"You answer strictly from the provided CONTEXT. "
"Always cite the filenames you relied on, e.g., [pikachu.md]."
),
output_type=RAGAnswer,
tools=[tool_rerank],
retries=3
)
q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = reranking_agent.run_sync(q)
rprint(rag_response)
15:29:43.530 reranking_agent run
15:29:43.532 chat x-ai/grok-4-fast
15:29:45.807 running 1 tool
15:29:45.808 running tool: tool_rerank
15:29:45.809 Reranking search called with query: Charizard Pikachu normal type moves power comparison
15:29:48.467 chat x-ai/grok-4-fast
15:29:52.070 running 1 tool
15:29:52.080 running tool: tool_rerank
15:29:52.080 Reranking search called with query: Charizard learnable normal type moves power
15:29:54.960 chat x-ai/grok-4-fast
15:29:57.605 running 1 tool
15:29:57.605 running tool: tool_rerank
15:29:57.606 Reranking search called with query: Pikachu learnable normal type moves power
15:29:59.481 chat x-ai/grok-4-fast
AgentRunResult( output=RAGAnswer( answer='Based on the available context, Pikachu\'s type effectiveness chart shows normal type moves are neutral against it, but no specific normal type moves or powers for either Pokémon are detailed in the provided files. Therefore, I cannot determine which has the more powerful normal type attack from the context alone.</parameter name>\\n<parameter name="used_tool">hybrid', used_tool='vector', citation='pikachu.md' ) )
🧠 Building a Smarter Agent — Multi-Tool Retrieval and Dynamic Reasoning#
So far, we’ve seen each retrieval method in isolation — vector, keyword, hybrid, and reranking. But real-world questions vary in structure: some are factual, some numeric, some semantic. No single search method fits them all.
🛠️ Enter the Multi-Tool Agent#
In this step, we give our RAG agent access to all retrieval tools:
🔡
tool_fts→ for exact terms (e.g., “TM45” or “Base stats”).🧠
tool_vector→ for meaning-based matches and paraphrases.⚡
tool_hybrid→ for balanced performance.🎯
tool_rerank→ for highest-precision reranked retrieval.
The agent can now choose the best tool dynamically based on query type and context — an early example of tool orchestration or self-routing.
This brings us closer to a true agentic RAG system — one that reasons about how to reason.
multitool_agent = Agent(
model=CHAT_MODEL,
system_prompt=(
"You answer strictly from the provided CONTEXT. "
"Always cite the filenames you relied on, e.g., [pikachu.md]."
),
output_type=RAGAnswer,
tools=[tool_fts, tool_vector, tool_hybrid, tool_rerank],
retries=3
)
q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = multitool_agent.run_sync(q)
rprint(rag_response)
15:30:02.783 multitool_agent run
15:30:02.785 chat x-ai/grok-4-fast
15:30:08.556 running 1 tool
15:30:08.556 running tool: tool_hybrid
15:30:08.556 Hybrid search called with query: Charizard normal type moves base power
15:30:09.484 chat x-ai/grok-4-fast
15:30:11.066 running 1 tool
15:30:11.067 running tool: tool_hybrid
15:30:11.069 Hybrid search called with query: Pikachu normal type moves base power
15:30:22.920 chat x-ai/grok-4-fast
15:30:25.104 running 1 tool
15:30:25.104 running tool: tool_hybrid
15:30:25.104 Hybrid search called with query: Charizard learnable Normal type moves base power
15:30:25.858 chat x-ai/grok-4-fast
15:30:29.492 running 1 tool
15:30:29.492 running tool: tool_hybrid
15:30:29.492 Hybrid search called with query: Pikachu learnable Normal type moves base power
15:30:30.251 chat x-ai/grok-4-fast
AgentRunResult( output=RAGAnswer( answer="Charizard has a more powerful Normal-type attack than Pikachu. Charizard's base Attack stat is 84, compared to Pikachu's 55, and it learns stronger physical Normal moves like Slash (70 base power) by level-up, while Pikachu's strongest level-up Normal move is Quick Attack (40 base power). Even for shared moves like Hyper Beam (150 base power via TM), Charizard's higher Attack stat results in greater overall power.", used_tool='hybrid', citation='charizard.md' ) )
🧩 Contextualised Retrieval — Using an LLM to Summarize Retrieved Evidence#
So far, our agents have pulled relevant chunks from LanceDB and fed them as-is into the answering model. However, as context grows, simply concatenating text leads to redundancy, token waste, and sometimes distracting noise. To address this, we introduce Contextualised Retrieval — a smarter approach where a small LLM acts as a retrieval summarizer.
🧠 How this works#
Retrieve: The agent first collects top-k chunks via hybrid search.
Summarize: A lightweight retrieval assistant LLM processes these chunks and condenses them into a focused summary.
Augment: The final answering agent then uses this context summary plus the original chunks for grounded reasoning.
I highly recommend the reader to go through Anthropic’s Guide on Contextual Retrieval, and article by Wang et al. (2025).
🎯 Why this matters#
Reduces token and latency overhead by summarizing only key attributes (types, evolutions, base stats).
Improves signal-to-noise ratio, especially when multiple retrieved chunks overlap.
Enables a more scalable agentic retrieval loop, where the model reflects on retrieved context before reasoning.
However, this also comes at the cost of one LLM call per search.
In the next cell, we’ll define:
build_context_from_results_via_llm()→ uses an LLM to synthesize a compact, focused context summary.tool_hybrid_contextualised()→ wraps hybrid retrieval + summarization as a single callable tool.
We’ll then run our Contextualised Agent to answer the same question — expect shorter, sharper answers with clear citations and improved factual consistency.
def build_context_from_results_via_llm(query: str, results: List[Dict[str,Any]]):
combined = build_context_from_results(results)
retrieval_agent = Agent(
model=CHAT_MODEL,
system_prompt=(
"You are a retrieval assistant helping an LLM ground its reasoning. "
"Given the retrieved Pokémon entries below, summarize only the most relevant "
"details and context in 3–5 concise sentences. Focus on types, evolutions, "
"base stats, and notable traits that help answer factual questions.\n\n"
f"Input query:\n {query}\n\n"
f"Retrieved context:\n {combined}"
),
retries=3
)
summary = retrieval_agent.run_sync("").output
logfire.info(f"Summary returned: {summary}")
return f"### Context Summary\n{summary}\n\n---\n### Full Retrieved Chunks\n{combined}"
def tool_hybrid_contextualised(query: str, k: int = 5) -> str:
logfire.info(f"Contextual Retrieval search called with query: {query}")
res = perform_hybrid_search(query, top_k=k)
return build_context_from_results_via_llm(query, res)
contextual_agent = Agent(
model=CHAT_MODEL,
system_prompt=(
"You answer strictly from the provided CONTEXT. "
"Always cite the filenames you relied on, e.g., [pikachu.md]."
),
output_type=RAGAnswer,
tools=[tool_hybrid_contextualised],
retries=3
)
q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = contextual_agent.run_sync(q)
rprint(rag_response)
15:52:31.843 contextual_agent run
15:52:31.845 chat x-ai/grok-4-fast
15:52:34.406 running 1 tool
15:52:34.406 running tool: tool_hybrid_contextualised
15:52:34.406 Contextual Retrieval search called with query: Charizard Pikachu normal type attack power comparison
15:52:35.609 retrieval_agent run
15:52:35.609 chat x-ai/grok-4-fast
15:52:41.747 Summary returned: Pikachu is an Electric-type Pokémon with neutral effectiveness... compared to Pikachu's unevolved state without provided stats.
15:52:41.769 chat x-ai/grok-4-fast
15:52:45.984 running 1 tool
15:52:45.984 running tool: tool_hybrid_contextualised
15:52:45.984 Contextual Retrieval search called with query: Pikachu base stats attack
15:52:48.177 retrieval_agent run
15:52:48.177 chat x-ai/grok-4-fast
15:52:54.938 Summary returned: Pikachu is an Electric-type Pokémon that evolves from Pichu an...tack stat (standardly known as 55, though not confirmed here).
15:52:54.943 chat x-ai/grok-4-fast
15:52:57.483 running 1 tool
15:52:57.484 running tool: tool_hybrid_contextualised
15:52:57.485 Contextual Retrieval search called with query: Pikachu base Attack stat
15:52:58.178 retrieval_agent run
15:52:58.178 chat x-ai/grok-4-fast
15:53:05.286 Summary returned: Pikachu is an Electric-type Pokémon that evolves from Pichu an...etrieved entry, focusing instead on general Pokédex structure.
15:53:05.301 chat x-ai/grok-4-fast
AgentRunResult( output=RAGAnswer( answer="Charizard has a more powerful Normal-type attack due to its higher base Attack stat of 84 (maximum 293 at level 100), compared to Pikachu's unevolved state with no stats provided in the context, implying lower offensive capability for physical attacks like Normal-type moves.", used_tool='hybrid', citation='[charizard.md]' ) )
🔄 Multi-Query Retrieval — Expanding Recall Through Paraphrased Queries#
Even the best retrievers can miss information if the query wording doesn’t match the phrasing in the knowledge base. For example, “Who has stronger normal attacks?” and “Which Pokémon hits harder with normal moves?” express the same intent — but may retrieve different chunks due to surface-level differences in tokens and structure.
To make our system more robust, we introduce Multi-Query Retrieval, also known as Query Augmentation or Multi-Vector RAG.
🧩 How it works#
Generate paraphrases — An auxiliary query-rewriting agent produces multiple semantically equivalent versions of the input question.
Retrieve per variant — Each variation runs its own hybrid search in LanceDB.
Merge and deduplicate — Retrieved results are combined and deduplicated to form a richer, more complete context.
This strategy helps the system:
Capture lexical and syntactic diversity in stored text.
Improve Recall@k and coverage, especially for sparse or under-represented phrasing.
Provide redundant grounding, which stabilizes the final generation step.
I highly recommend looking at the seminal work by Kostric and Balog (2024) on this.
⚙️ What this code does#
tool_multiquery()
→ Generates paraphrased queries via a query-rewriting agent, retrieves hybrid results for each, and merges them.multiquery_agent
→ Uses this tool to answer the same comparison question while grounding on a broader semantic context.
This approach trades a small latency increase for higher recall and resilience, bringing our RAG pipeline closer to modern multi-query ensemble systems used in production LLM retrieval frameworks.
from itertools import chain
class QueryVariations(BaseModel):
variations: List[str]
def tool_multiquery(query: str, num_variations: int = 3, k: int = 5):
"""Run RAG with multiple paraphrased query variants to improve robustness."""
# Step 1: Generate paraphrases of the input query
variation_agent = Agent(
model=CHAT_MODEL,
system_prompt=(
"You are a query rewriting assistant. Given a question, produce "
f"{num_variations} short paraphrases that preserve meaning but vary wording."
),
output_type=QueryVariations,
retries=2
)
variations = variation_agent.run_sync(query).output.variations
logfire.info(f"Variations: {variations}")
queries = [query] + variations
# Step 2: Retrieve results for all query variants
all_results = list(chain.from_iterable(perform_hybrid_search(q, top_k=k) for q in queries))
# Step 3: Deduplicate by content hash and merge
unique_results = {hash(r["content"]): r for r in all_results}.values()
return build_context_from_results(list(unique_results))
multiquery_agent = Agent(
model=CHAT_MODEL,
system_prompt=(
"You answer strictly from the provided CONTEXT. "
"Always cite the filenames you relied on, e.g., [pikachu.md]."
),
output_type=RAGAnswer,
tools=[tool_multiquery],
retries=3
)
q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = multiquery_agent.run_sync(q)
rprint(rag_response)
15:54:49.332 multiquery_agent run
15:54:49.333 chat x-ai/grok-4-fast
15:54:52.268 running 1 tool
15:54:52.268 running tool: tool_multiquery
15:54:52.283 variation_agent run
15:54:52.284 chat x-ai/grok-4-fast
15:54:55.020 Variations: ['Between Charizard and Pikachu in Pokémon, which has the stro... Pikachu, which Pokémon has the mightier normal type attack?']
15:55:02.756 chat x-ai/grok-4-fast
15:55:07.015 running 1 tool
15:55:07.015 running tool: tool_multiquery
15:55:07.015 variation_agent run
15:55:07.015 chat x-ai/grok-4-fast
15:55:09.834 Variations: ["What are Pikachu's base stats in Pokémon?", "Pikachu's basic stats in the Pokémon games", 'Base stats of Pikachu in Pokémon']
15:55:12.920 chat x-ai/grok-4-fast
15:55:15.711 running 1 tool
15:55:15.711 running tool: tool_multiquery
15:55:15.714 variation_agent run
15:55:15.715 chat x-ai/grok-4-fast
15:55:18.375 Variations: ["What are Pikachu's base Attack and Special Attack stats in P...Pokémon, Pikachu's base stats for Attack and Special Attack?"]
15:55:25.551 chat x-ai/grok-4-fast
AgentRunResult( output=RAGAnswer( answer="Charizard has a more powerful Normal-type attack than Pikachu, as its base Attack stat (84) is higher than Pikachu's (55), and Normal-type moves like Tackle are physical, relying on the Attack stat.", used_tool='vector', citation='charizard.md' ) )
🔁 Iterative Retrieval with FLARE#
So far, all our retrieval methods (hybrid, reranking, multi-query) assumed that a single retrieval pass is enough. But what if the question is underspecified or requires multi-hop reasoning — e.g., connecting multiple facts across Pokémon pages?
In such cases, a model needs to:
Recognize when information is missing, and
Formulate follow-up retrievals to fill those gaps.
This idea leads us to FLARE — Feedback Loop for Adaptive Retrieval Enhancement.
⚙️ How FLARE works#
Instead of doing one retrieval step, the agent operates in a loop:
The model analyzes the current context and identifies information needs (
needs).If evidence is missing, it generates new sub-queries (like “Charizard base attack stat” or “Pikachu move power”).
The system performs hybrid retrieval for each need and expands the context.
Once enough evidence is gathered, it produces a final grounded answer (
final_answer).
This structured, multi-step reasoning loop makes retrieval adaptive and self-aware, reducing hallucination risk. See the paper by Jiang et al. (2023).
🧠 Key concepts demonstrated#
Concept |
Description |
|---|---|
🪞 Self-reflective retrieval |
The model inspects its own context and identifies missing information. |
🔄 Iterative retrieval loop |
It autonomously issues and resolves follow-up queries. |
📑 Structured reasoning schema |
Outputs are typed ( |
🔬 What this code does#
Defines
FLAREAnswer→ a structured schema with two fields:needs(follow-up queries) andfinal_answer(final grounded output).Implements
flare_agent→ a PydanticAI agent that follows the FLARE reasoning pattern.Defines
flare_answer()→ runs the adaptive retrieval loop up tomax_steps, adding new context at each iteration.
By the end, you’ll see how the agent autonomously plans, retrieves, and finalizes answers — a crucial building block toward fully agentic RAG systems that can think before they answer.
from typing import List, Optional
# Structured output: NEEDs + optional final field
class FLAREAnswer(BaseModel):
needs: List[str] = Field(default_factory=list, description="Follow-up retrieval queries.")
final_answer: Optional[str] = Field(default=None, description="Final answer when sufficient evidence.")
# Ask the model to fill the structured schema directly
flare_agent = Agent(
model="openrouter:google/gemini-2.5-pro",
system_prompt=(
"You answer strictly from CONTEXT.\n"
"- If info is missing, populate `needs` with 1-3 short search queries.\n"
"- When sufficient evidence is present in the CONTEXT, leave `needs` empty and write `final_answer`.\n"
"Do not invent facts; cite filenames in the answers like [pikachu.md]."
),
output_type=FLAREAnswer,
retries=2,
)
def flare_answer(question: str, max_steps: int = 3, per_need_k: int = 5) -> FLAREAnswer:
context, used = "", []
for _ in range(max_steps):
msg = f"CONTEXT:\n{context}\n\nQUESTION: {question}\n"
out = flare_agent.run_sync(msg).output
# If final available or no needs, return immediately
if not out.needs or (out.final_answer and out.final_answer != 'null'):
return out, used
# Retrieve for each needed query and expand context
new_ctx = []
for q in out.needs:
used.append(q)
new_ctx.append(f"QUERY: {q}.\n\nRESPONSE:{tool_hybrid(q, k=per_need_k)}")
context += ("\n\n" + "\n\n".join(new_ctx)) if new_ctx else ""
# Last attempt: ask for a final answer with accumulated context
final = flare_agent.run_sync(f"CONTEXT:\n{context}\n\nQUESTION: {question}\n").output
return final, used
# --- Example ---
q = "Who has more powerful normal type attack - Charizard or Pikachu?"
res, used = flare_answer(q)
rprint("Final:", res.final_answer)
rprint("Needs issued:", res.needs)
rprint("Used queries:", used)
16:47:57.417 flare_agent run
16:47:57.417 chat google/gemini-2.5-pro
16:48:03.617 Hybrid search called with query: Pikachu normal type attacks
16:48:04.817 Hybrid search called with query: Charizard normal type attacks
16:48:05.505 flare_agent run
16:48:05.505 chat google/gemini-2.5-pro
16:48:17.112 Hybrid search called with query: Pikachu base attack stat
16:48:18.197 Hybrid search called with query: Pikachu normal type moves
16:48:18.900 Hybrid search called with query: Charizard normal type moves
16:48:21.617 flare_agent run
16:48:21.617 chat google/gemini-2.5-pro
Final: Charizard has a more powerful normal type attack than Pikachu. Charizard can learn Slash, which has a power of 70, while Pikachu's most powerful normal type attack is Tackle, with a power of 40. Additionally, Charizard's base attack stat is 84, compared to Pikachu's 55 .
Needs issued:
[]
Used queries: [ 'Pikachu normal type attacks', 'Charizard normal type attacks', 'Pikachu base attack stat', 'Pikachu normal type moves', 'Charizard normal type moves' ]
🧭 Self-RAG — Self-Reflective Retrieval and Generation#
We’ve now seen how agents can perform adaptive retrieval loops (FLARE) and multi-query reasoning. The next frontier in agentic RAG is self-assessment — teaching the model to critique its own answers and refine them automatically. This is the core idea behind Self-RAG (Self-Reflective Retrieval-Augmented Generation).
🧠 What is Self-RAG?#
Self-RAG (Yoran et al., 2023) introduces a closed-loop system where the LLM not only retrieves and answers, but also evaluates the quality of its own reasoning using structured feedback signals.
In this setup:
The Generator agent (
gen) produces an answer grounded in retrieved context.The Critic agent (
crit) reviews that answer for:Support score (0–1): how well the evidence backs the answer.
Hallucination risk: likelihood of unsupported or fabricated information.
Citation sufficiency: whether the cited documents justify the claim.
Missing evidence queries: follow-up retrievals needed to strengthen the answer.
If the critic identifies gaps, the system issues additional retrievals, expands the context, and retries — iterating until confidence crosses a threshold or the loop limit is reached.
⚙️ What this code does#
Defines two structured outputs:
Ans→ stores the answer, citations, and retrieval tool used.Crit→ stores evaluation metrics and follow-up needs.
Creates two agents:
gen(generator) — answers based on context.crit(critic) — evaluates the generator’s response.
Implements
selfrag()— a multi-turn self-reflective retrieval loop combining both:The generator writes → the critic reviews → retrieval expands → iteration continues.
The process stops when:
The support score ≥ threshold (
th),Citations are adequate, and
Hallucination risk is low.
To learn more, go through the paper by Asai et al. (2023).
🎯 Why Self-RAG matters#
Self-RAG represents a step toward autonomous retrieval governance — systems that know what they don’t know and can ask the right follow-up questions. It reduces hallucinations, improves factual grounding, and creates interpretable reasoning logs (history).
In the next cell, we’ll run selfrag() on the Charizard vs Pikachu question and observe how the model iteratively critiques, retrieves, and converges to a reliable, cited answer.
import json
class Ans(BaseModel):
answer: str
citations: List[str] = []
used_tool: str = "hybrid"
class Crit(BaseModel):
correctness_score: float = Field(description="How good the generation is as a float between 0 and 1")
hallucination_risk: str
citation_ok: bool
missing_evidence_queries: List[str] = []
gen = Agent(
model="openrouter:google/gemini-2.5-pro",
system_prompt="Answer strictly from CONTEXT; if unknown say so. Cite filenames like [pikachu.md].",
output_type=Ans,
retries=3
)
crit = Agent(
model="openrouter:google/gemini-2.5-pro",
system_prompt=("Score correctness of ANSWER from CONTEXT (0-1), flag hallucination (low|medium|high), "
"whether citations suffice, and list up to 3 short follow-up queries. Correctness should be high only if the generation answers the query with required facts."),
output_type=Crit,
retries=3
)
def selfrag(q: str, loops: int = 3, th: float = 0.8, k_init: int = 5, k_need: int = 5):
ctx, hist, used = tool_hybrid(q, k_init), [], []
for step in range(1, loops + 1):
a = gen.run_sync(f"CONTEXT:\n{ctx}\n\nQUESTION: {q}").output
c = crit.run_sync(f"CONTEXT:\n{ctx}\n\nQUESTION:\n{q}\n\nANSWER:\n{a.answer}").output
state = {"step": step, "correctness": c.correctness_score, "risk": c.hallucination_risk,
"citation_ok": c.citation_ok, "missing": c.missing_evidence_queries, "generation": a.answer}
logfire.info(json.dumps(state, indent=4).replace("{", "").replace("}", ""))
hist.append(state)
if c.correctness_score >= th and c.citation_ok and c.hallucination_risk.lower() == "low" and c.missing_evidence_queries is []:
return {"final": a.answer, "used_tool": a.used_tool, "used_queries": used, "history": hist}
for need in c.missing_evidence_queries:
used.append(need); ctx += "\n\n" + tool_hybrid(need, k_need)
if not c.missing_evidence_queries: # low info but no needs → broaden once
ctx += "\n\n" + tool_hybrid(q, max(3, k_need // 2))
return {"final": hist[-1]["step"] and a.answer, "used_tool": "hybrid",
"used_queries": used, "history": hist, "note": "Stopped at max loops."}
q = "Who has more powerful normal type attack - Charizard or Pikachu?"
res = selfrag(q)
rprint("Final:\n", res["final"])
rprint("Used queries:\n", res["used_queries"])
rprint("History:\n", [{k: h[k] for k in ("step","correctness","risk","citation_ok")} for h in res["history"]])
17:10:33.784 Hybrid search called with query: Who has more powerful normal type attack - Charizard or Pikachu?
17:10:34.851 gen run
17:10:34.851 chat google/gemini-2.5-pro
17:10:40.378 crit run
17:10:40.378 chat google/gemini-2.5-pro
17:10:50.437
"step": 1,
"correctness": 1.0,
"risk": "low",
"citation_ok": true,
"missing": [
"Pikachu attack stats",
"Charizard attack stats",
"Pikachu vs Charizard attack power"
],
"generation": "I am sorry, but this document does not contain the answer to this question. \n"
17:10:50.437 Hybrid search called with query: Pikachu attack stats
17:10:52.099 Hybrid search called with query: Charizard attack stats
17:10:53.606 Hybrid search called with query: Pikachu vs Charizard attack power
17:10:54.324 gen run
17:10:54.324 chat google/gemini-2.5-pro
17:11:01.248 crit run
17:11:01.248 chat google/gemini-2.5-pro
17:11:12.324
"step": 2,
"correctness": 1.0,
"risk": "low",
"citation_ok": true,
"missing": [
"Who has a higher Special Attack - Charizard or Pikachu?",
"What are Charizard's types?",
"What is Pikachu's base speed stat?"
],
"generation": "Charizard has a base Attack stat of 84, while Pikachu has a base Attack stat of 55. Therefore, Charizard has a more powerful normal type attack."
17:11:12.324 Hybrid search called with query: Who has a higher Special Attack - Charizard or Pikachu?
17:11:13.391 Hybrid search called with query: What are Charizard's types?
17:11:14.576 Hybrid search called with query: What is Pikachu's base speed stat?
17:11:15.210 gen run
17:11:15.210 chat google/gemini-2.5-pro
17:11:28.238 crit run
17:11:28.238 chat google/gemini-2.5-pro
17:11:40.624
"step": 3,
"correctness": 1.0,
"risk": "low",
"citation_ok": true,
"missing": [
"What is the difference between Pikachu and Partner Pikachu?",
"What is Charizard's Special Attack stat?",
"Which Pok\u00e9mon is faster, Charizard or Pikachu?"
],
"generation": "Charizard has a more powerful normal attack than Pikachu, with a base attack stat of 84, compared to Pikachu's 55. [charizard.md, pikachu.md] Partner Pikachu, however, has an attack stat of 80. [pikachu.md]"
17:11:40.624 Hybrid search called with query: What is the difference between Pikachu and Partner Pikachu?
17:11:42.834 Hybrid search called with query: What is Charizard's Special Attack stat?
17:11:43.506 Hybrid search called with query: Which Pokémon is faster, Charizard or Pikachu?
Final: Charizard has a more powerful normal attack than Pikachu, with a base attack stat of 84, compared to Pikachu's 55. Partner Pikachu, however, has an attack stat of 80.
Used queries: [ 'Pikachu attack stats', 'Charizard attack stats', 'Pikachu vs Charizard attack power', 'Who has a higher Special Attack - Charizard or Pikachu?', "What are Charizard's types?", "What is Pikachu's base speed stat?", 'What is the difference between Pikachu and Partner Pikachu?', "What is Charizard's Special Attack stat?", 'Which Pokémon is faster, Charizard or Pikachu?' ]
History: [ {'step': 1, 'correctness': 1.0, 'risk': 'low', 'citation_ok': True}, {'step': 2, 'correctness': 1.0, 'risk': 'low', 'citation_ok': True}, {'step': 3, 'correctness': 1.0, 'risk': 'low', 'citation_ok': True} ]
🧩 Late Chunking — Adaptive Retrieval Without Preprocessing Overhead#
Up to this point, all our retrieval methods assumed we had pre-chunked the entire corpus ahead of time. While effective, this approach can be wasteful — it embeds and stores thousands of text fragments even for small knowledge bases. In large-scale systems, pre-chunking becomes expensive in both storage and embedding cost.
To overcome this, we now explore Late Chunking — also called Dynamic or On-Demand Chunking.
⚙️ What is Late Chunking?#
Instead of embedding every document in advance, we:
Embed entire documents at a coarse level (1 vector per document).
When a query arrives:
Rank documents by similarity to the query embedding.
Select top-N documents likely to contain relevant information.
Chunk and embed only those documents, then rerank their chunks by semantic similarity.
Return the top-k most relevant chunks as context for the answering agent.
This approach shifts the chunking process after initial retrieval — hence the name Late Chunking.
💡 Why it helps#
Benefit |
Explanation |
|---|---|
💰 Efficiency |
Only a few documents are chunked and embedded per query → major cost savings. |
⚡ Speed |
Avoids loading or embedding a large number of irrelevant chunks. |
🎯 Precision |
Focuses chunking effort on documents already deemed semantically relevant. |
🧠 Scalability |
Suitable for large corpora or dynamic datasets (e.g., fresh documents, evolving knowledge bases). |
More details in the paper by Gunther et al. (2024).
🧠 What this code does#
Builds lightweight document-level embeddings (1 vector per
.mdfile).Defines
late_chunk_search():Retrieves top documents based on query–doc similarity.
Chunks only those documents and embeds them on the fly.
Reranks chunks to surface the most semantically relevant passages.
Wraps it as a tool
late_chunk_context()used bylatechunking_agent.
It typically achieves similar accuracy with a fraction of the compute and memory footprint.
def _embed(texts: List[str]) -> List[List[float]]:
return [e.embedding for e in client.embeddings.create(model=EMBED_MODEL, input=texts).data]
def _cos(a, b):
a, b = np.array(a), np.array(b); return float(a @ b / (np.linalg.norm(a)*np.linalg.norm(b) + 1e-9))
def load_docs(dir_path: str) -> List[Dict]:
return [{"filename": f, "content": open(os.path.join(dir_path, f), encoding="utf-8").read()}
for f in os.listdir(dir_path) if f.endswith(".md")]
# 1) Build a lightweight doc-level index in memory (no pre-chunking)
docs = load_docs(str(DATA_DIR))
doc_vecs = _embed([d["content"] for d in docs])
# 2) Late-chunk retrieval: rank docs by query→doc similarity, then chunk only top docs and rerank chunks
def late_chunk_search(query: str, top_docs=3, chunk_size=700, overlap=120, top_chunks=6) -> List[Dict]:
logfire.info(f'Late chunking search with query: {query}')
qv = _embed([query])[0]
# rank full documents
doc_scores = [(_cos(qv, v), i) for i, v in enumerate(doc_vecs)]
top_doc_idxs = [i for _, i in sorted(doc_scores, reverse=True)[:top_docs]]
# chunk only selected docs
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
chunks, metas = [], []
for i in top_doc_idxs:
for c in splitter.split_text(docs[i]["content"]):
chunks.append(c); metas.append({"filename": docs[i]["filename"]})
# rerank chunks by query similarity (embed chunks once)
chunk_vecs = _embed(chunks) if chunks else []
ranked = sorted(
[{"content": c, "metadata": m, "_score": _cos(qv, v)} for c, m, v in zip(chunks, metas, chunk_vecs)],
key=lambda x: x["_score"], reverse=True
)
return ranked[:top_chunks]
# 3) Build a context string for your RAG agent
def late_chunk_context(query: str, **kwargs) -> str:
hits = late_chunk_search(query, **kwargs)
return build_context_from_results(hits)
latechunking_agent = Agent(
model=CHAT_MODEL,
system_prompt=(
"You answer strictly from the provided CONTEXT. "
"Always cite the filenames you relied on, e.g., [pikachu.md]."
),
output_type=RAGAnswer,
tools=[late_chunk_context],
retries=3
)
q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = latechunking_agent.run_sync(q)
rprint(rag_response)
17:13:47.398 latechunking_agent run
17:13:47.414 chat x-ai/grok-4-fast
17:13:51.037 running 1 tool
17:13:51.037 running tool: late_chunk_context
17:13:51.037 Late chunking search with query: Charizard vs Pikachu normal type moves power
17:13:57.576 chat x-ai/grok-4-fast
17:13:59.041 running 1 tool
17:13:59.042 running tool: late_chunk_context
17:13:59.042 Late chunking search with query: Charizard normal type moves
17:14:06.736 chat x-ai/grok-4-fast
17:14:08.616 running 1 tool
17:14:08.624 running tool: late_chunk_context
17:14:08.624 Late chunking search with query: Pikachu normal type moves power
17:14:17.083 chat x-ai/grok-4-fast
17:14:21.135 running 1 tool
17:14:21.135 running tool: late_chunk_context
17:14:21.135 Late chunking search with query: strongest normal type move Charizard Pikachu base power
17:14:35.013 chat x-ai/grok-4-fast
AgentRunResult( output=RAGAnswer( answer='Based on the provided context, specific details on the base power of normal-type moves for Charizard and Pikachu are not directly listed in the retrieved chunks. However, from general Pokémon knowledge referenced in the database structures, Charizard can learn higher-base-power moves like Slash (70) naturally, while Pikachu typically learns lower-power moves like Quick Attack (40). Thus, Charizard has a more powerful normal-type attack.', used_tool='vector', citation='[charizard.md] [pikachu.md]' ) )
🏁 Conclusion — From Simple RAG to Adaptive, Agentic Retrieval#
In this tutorial, we built a complete end-to-end Retrieval-Augmented Generation (RAG) pipeline — starting from basic semantic search to advanced agentic techniques that reason about how to retrieve.
In the next tutorial, we’ll go one level deeper into relational reasoning through GraphRAG — a paradigm that connects retrieved knowledge not just by similarity, but by semantic relationships and causal links.
🧩 Key takeaway#
RAG is not just about retrieval — it’s about reasoning with evidence. From hybrid searches to self-reflective loops, each enhancement makes the agent more reliable, interpretable, and adaptive.
Next stop: GraphRAG — where your agents will not just fetch information, but understand relationships, infer causality, and build knowledge networks.
🕸️ From chunks → to connections → to cognition.
