04. The Art and Science of Data for LLMs#
⚡Compute Note: You can run this notebook on CPU.
Welcome to the fourth part of our series! So far, we have built and optimized a GPT model from scratch. However, a powerful model architecture is only half of the equation. The other, arguably more important, half is the data it’s trained on.
The principle of “garbage in, garbage out” has never been more true than in the age of LLMs. The quality, diversity, and cleanliness of your training data directly determine your model’s capabilities, its biases, and its failure modes. In this notebook, we will explore:
What large-scale datasets look like: We’ll load a subset of a massive web-scraped dataset.
Common data quality issues: We’ll inspect raw data to find boilerplate, code, and other artifacts.
Filtering and cleaning techniques: We’ll discuss and implement simple heuristics to improve data quality.
The impact of curation: We’ll compare our raw dataset to a highly-filtered one to see the difference.
For this tutorial, we’ll rely heavily on the 🤗 datasets library, which is the standard for accessing and processing massive datasets efficiently.
I highly recommend the reader to go through the Stanford CS336 Data lectures.
Setup#
First, let’s install the necessary libraries. We need datasets to download and process our data, and matplotlib for visualization.
%pip install datasets matplotlib
Requirement already satisfied: datasets in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (4.2.0)
Requirement already satisfied: matplotlib in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (3.8.2)
Requirement already satisfied: filelock in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from datasets) (3.19.1)
Requirement already satisfied: numpy>=1.17 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from datasets) (1.26.4)
Requirement already satisfied: pyarrow>=21.0.0 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from datasets) (21.0.0)
Requirement already satisfied: dill<0.4.1,>=0.3.0 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from datasets) (0.4.0)
Requirement already satisfied: pandas in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from datasets) (2.1.4)
Requirement already satisfied: requests>=2.32.2 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from datasets) (2.32.5)
Requirement already satisfied: httpx<1.0.0 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from datasets) (0.28.1)
Requirement already satisfied: tqdm>=4.66.3 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from datasets) (4.67.1)
Requirement already satisfied: xxhash in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from datasets) (3.6.0)
Requirement already satisfied: multiprocess<0.70.17 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from datasets) (0.70.16)
Requirement already satisfied: fsspec<=2025.9.0,>=2023.1.0 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from fsspec[http]<=2025.9.0,>=2023.1.0->datasets) (2025.9.0)
Requirement already satisfied: huggingface-hub<2.0,>=0.25.0 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from datasets) (0.35.3)
Requirement already satisfied: packaging in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from datasets) (25.0)
Requirement already satisfied: pyyaml>=5.1 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from datasets) (6.0.3)
Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from fsspec[http]<=2025.9.0,>=2023.1.0->datasets) (3.12.15)
Requirement already satisfied: anyio in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from httpx<1.0.0->datasets) (4.11.0)
Requirement already satisfied: certifi in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from httpx<1.0.0->datasets) (2025.8.3)
Requirement already satisfied: httpcore==1.* in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from httpx<1.0.0->datasets) (1.0.9)
Requirement already satisfied: idna in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from httpx<1.0.0->datasets) (3.10)
Requirement already satisfied: h11>=0.16 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from httpcore==1.*->httpx<1.0.0->datasets) (0.16.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from huggingface-hub<2.0,>=0.25.0->datasets) (4.15.0)
Requirement already satisfied: hf-xet<2.0.0,>=1.1.3 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from huggingface-hub<2.0,>=0.25.0->datasets) (1.1.10)
Requirement already satisfied: contourpy>=1.0.1 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from matplotlib) (1.3.3)
Requirement already satisfied: cycler>=0.10 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from matplotlib) (4.60.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from matplotlib) (1.4.9)
Requirement already satisfied: pillow>=8 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from matplotlib) (11.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from matplotlib) (3.2.5)
Requirement already satisfied: python-dateutil>=2.7 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.9.0,>=2023.1.0->datasets) (2.6.1)
Requirement already satisfied: aiosignal>=1.4.0 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.9.0,>=2023.1.0->datasets) (1.4.0)
Requirement already satisfied: attrs>=17.3.0 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.9.0,>=2023.1.0->datasets) (25.3.0)
Requirement already satisfied: frozenlist>=1.1.1 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.9.0,>=2023.1.0->datasets) (1.7.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.9.0,>=2023.1.0->datasets) (6.6.4)
Requirement already satisfied: propcache>=0.2.0 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.9.0,>=2023.1.0->datasets) (0.3.2)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.9.0,>=2023.1.0->datasets) (1.20.1)
Requirement already satisfied: six>=1.5 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib) (1.17.0)
Requirement already satisfied: charset_normalizer<4,>=2 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from requests>=2.32.2->datasets) (3.4.3)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from requests>=2.32.2->datasets) (2.5.0)
Requirement already satisfied: sniffio>=1.1 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from anyio->httpx<1.0.0->datasets) (1.3.1)
Requirement already satisfied: pytz>=2020.1 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from pandas->datasets) (2025.2)
Requirement already satisfied: tzdata>=2022.1 in /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages (from pandas->datasets) (2025.2)
Note: you may need to restart the kernel to use updated packages.
import datasets
from datasets import load_dataset
import matplotlib.pyplot as plt
import re
# Set default figure size for plots
plt.rcParams['figure.figsize'] = (10, 6)
1. Exploring a Raw Web Dataset: C4#
The Colossal Cleaned Common Crawl (C4) dataset was created by Google for training their T5 models. It’s a massive scrape of the public internet, which has undergone some basic cleaning (like removing offensive words and deduplication). However, it’s still considered relatively “raw” compared to more modern, heavily curated datasets.
Let’s load a small part of the C4 dataset to see what it looks like. We’ll use streaming=True to avoid downloading the entire dataset, which is several terabytes!
# Load the C4 dataset in streaming mode
c4_dataset = load_dataset("allenai/c4", "en", streaming=True, split='train')
# Let's look at the first few examples
print("--- Raw C4 Dataset Examples ---")
for i, example in enumerate(iter(c4_dataset.take(5))):
print(f"\n--- Example {i+1} ---")
# Print the first 500 characters of the text
print(example['text'][:500])
--- Raw C4 Dataset Examples ---
--- Example 1 ---
Beginners BBQ Class Taking Place in Missoula!
Do you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.
He will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat select
--- Example 2 ---
Discussion in 'Mac OS X Lion (10.7)' started by axboi87, Jan 20, 2012.
I've got a 500gb internal drive and a 240gb SSD.
When trying to restore using disk utility i'm given the error "Not enough space on disk ____ to restore"
But I shouldn't have to do that!!!
Any ideas or workarounds before resorting to the above?
Use Carbon Copy Cloner to copy one drive to the other. I've done this several times going from larger HDD to smaller SSD and I wound up with a bootable SSD drive. One step you have to
--- Example 3 ---
Foil plaid lycra and spandex shortall with metallic slinky insets. Attached metallic elastic belt with O-ring. Headband included. Great hip hop or jazz dance costume. Made in the USA.
--- Example 4 ---
How many backlinks per day for new site?
Discussion in 'Black Hat SEO' started by Omoplata, Dec 3, 2010.
1) for a newly created site, what's the max # backlinks per day I should do to be safe?
2) how long do I have to let my site age before I can start making more blinks?
I did about 6000 forum profiles every 24 hours for 10 days for one of my sites which had a brand new domain.
There is three backlinks for every of these forum profile so thats 18 000 backlinks every 24 hours and nothing happene
--- Example 5 ---
The Denver Board of Education opened the 2017-18 school year with an update on projects that include new construction, upgrades, heat mitigation and quality learning environments.
We are excited that Denver students will be the beneficiaries of a four year, $572 million General Obligation Bond. Since the passage of the bond, our construction team has worked to schedule the projects over the four-year term of the bond.
Denver voters on Tuesday approved bond and mill funding measures for students
2. Identifying Data Quality Issues#
As you look through the examples above, you might notice some problems:
Boilerplate Text: Phrases like “log in,” “terms of use,” or cookie consent notices.
Code and Markup: Snippets of JavaScript, HTML, or CSS that are not natural language.
Strange Formatting: Excessive line breaks, weird characters, or garbled text.
Non-Prose Content: Lists, tables, or other structured data that doesn’t read like a book.
Training a model on this kind of data can teach it to generate undesirable content. Our goal is to filter the dataset to keep only high-quality, natural language prose.
3. Data Filtering Techniques#
Data filtering is a deep and complex field, but we can apply some simple yet powerful heuristics. Here are a few common ones:
Length Filtering: Remove documents that are too short or too long.
Character Filtering: Remove documents with a high percentage of non-alphanumeric characters.
Boilerplate Removal: Remove documents containing common web boilerplate phrases (e.g., “JavaScript is disabled”).
Repetition Removal: Remove documents with highly repetitive lines or n-grams.
Let’s create a simple filtering function that combines a few of these ideas.
def is_high_quality(example):
"""A simple heuristic-based filter for data quality."""
text = example['text']
# 1. Length filter
if len(text) < 200 or len(text) > 100000:
return False
# 2. Boilerplate filter
boilerplate_phrases = [
"terms of use", "privacy policy", "cookie policy",
"subscribe to our newsletter", "enable javascript"
]
if any(phrase in text.lower() for phrase in boilerplate_phrases):
return False
# 3. Character filter (check for high proportion of non-alphanumeric chars)
# This can be a proxy for code or heavily formatted text
alphanumeric_chars = sum(c.isalnum() for c in text)
if alphanumeric_chars / len(text) < 0.75:
return False
return True
# The .filter() method applies our function to each example
filtered_c4 = c4_dataset.filter(is_high_quality)
print("--- Filtered C4 Dataset Examples ---")
for i, example in enumerate(iter(filtered_c4.take(5))):
print(f"\n--- Example {i+1} ---")
print(example['text'][:500])
--- Filtered C4 Dataset Examples ---
--- Example 1 ---
Beginners BBQ Class Taking Place in Missoula!
Do you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.
He will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat select
--- Example 2 ---
Discussion in 'Mac OS X Lion (10.7)' started by axboi87, Jan 20, 2012.
I've got a 500gb internal drive and a 240gb SSD.
When trying to restore using disk utility i'm given the error "Not enough space on disk ____ to restore"
But I shouldn't have to do that!!!
Any ideas or workarounds before resorting to the above?
Use Carbon Copy Cloner to copy one drive to the other. I've done this several times going from larger HDD to smaller SSD and I wound up with a bootable SSD drive. One step you have to
--- Example 3 ---
How many backlinks per day for new site?
Discussion in 'Black Hat SEO' started by Omoplata, Dec 3, 2010.
1) for a newly created site, what's the max # backlinks per day I should do to be safe?
2) how long do I have to let my site age before I can start making more blinks?
I did about 6000 forum profiles every 24 hours for 10 days for one of my sites which had a brand new domain.
There is three backlinks for every of these forum profile so thats 18 000 backlinks every 24 hours and nothing happene
--- Example 4 ---
The Denver Board of Education opened the 2017-18 school year with an update on projects that include new construction, upgrades, heat mitigation and quality learning environments.
We are excited that Denver students will be the beneficiaries of a four year, $572 million General Obligation Bond. Since the passage of the bond, our construction team has worked to schedule the projects over the four-year term of the bond.
Denver voters on Tuesday approved bond and mill funding measures for students
--- Example 5 ---
BANGALORE CY JUNCTION SBC to GONDIA JUNCTION G train timings, routes, stops, and complete info.
As of now, 1 trains run between from BANGALORE CY JUNCTION (YPR) to GONDIA JUNCTION (G).
The fastest train from BANGALORE CY JUNCTION (YPR) to GONDIA JUNCTION (G) is YPR KRBA WAINGANGA EXP (12251) that departs at 23:40 and arrives to at 21:15. It takes approximately 21:35 hours.
While our simple filter helps, professional dataset creation involves much more sophisticated pipelines. Let’s look at a dataset that has already undergone this process.
4. A Look at a Highly Curated Dataset: FineWeb#
FineWeb, created by the Hugging Face team, is a great example of a state-of-the-art, highly filtered dataset. It starts from Common Crawl but applies a rigorous filtering and deduplication pipeline, resulting in over 15 trillion tokens of high-quality text.
Let’s load a sample of FineWeb and compare it to the raw C4 examples.
I also recommend going through the description of the fineweb to get an idea of how the quality is assessed for production datasets. Reference: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
# Load a sample of the FineWeb dataset
# We'll use a smaller 10B token sample for demonstration
fineweb_dataset = load_dataset("HuggingFaceFW/fineweb-edu", "sample-10BT", streaming=True, split='train')
print("--- FineWeb Dataset Examples ---")
for i, example in enumerate(iter(fineweb_dataset.take(5))):
print(f"\n--- Example {i+1} ---")
print(example['text'][:500])
--- FineWeb Dataset Examples ---
--- Example 1 ---
The Independent Jane
For all the love, romance and scandal in Jane Austen’s books, what they are really about is freedom and independence. Independence of thought and the freedom to choose.
Elizabeth’s refusal of Mr. Collins offer of marriage showed an independence seldom seen in heroines of the day. Her refusal of Mr. Darcy while triggered by anger showed a level of independence that left him shocked and stunned.
The freedom she exhibited in finally accepting him in direct defiance of Lady Cath
--- Example 2 ---
Taking Play Seriously
By ROBIN MARANTZ HENIG
Published: February 17, 2008
On a drizzly Tuesday night in late January, 200 people came out to hear a psychiatrist talk rhapsodically about play -- not just the intense, joyous play of children, but play for all people, at all ages, at all times. (All species too; the lecture featured touching photos of a polar bear and a husky engaging playfully at a snowy outpost in northern Canada.) Stuart Brown, president of the National Institute for Play, was s
--- Example 3 ---
How do you get HIV?
HIV can be passed on when infected bodily fluid, such as blood or semen, is passed into an uninfected person. Semen is the liquid which is released from a man's penis during sex which carries sperm. It can be infected with HIV or AIDS when someone is HIV positive or is carrying the AIDS virus. This can happen during unprotected sex. For example when two people have sex without using a condom when one partner is already infected, or between drug users who inject and share need
--- Example 4 ---
CTComms sends on average 2 million emails monthly on behalf of over 125 different charities and not for profits.
Take the complexity of technology and stir in the complexity of the legal system and what do you get? Software licenses! If you've ever attempted to read one you know how true this is, but you have to know a little about software licensing even if you can't parse all of the fine print.
By: Chris Peters
March 10, 2009
A software license is an agreement between you and the owner of a pr
--- Example 5 ---
Hold the salt: UCLA engineers develop revolutionary new desalination membrane
Process uses atmospheric pressure plasma to create filtering 'brush layer'
Desalination can become more economical and used as a viable alternate water resource.
By Wileen Wong Kromhout
Originally published in UCLA Newsroom
Researchers from the UCLA Henry Samueli School of Engineering and Applied Science have unveiled a new class of reverse-osmosis membranes for desalination that resist the clogging which typically occ
Conclusion: Quality Over Quantity#
Comparing the raw C4 examples with the FineWeb examples, the difference is clear. The FineWeb text is much cleaner, reads more like natural prose, and is free of the distracting artifacts common in raw web data.
However, extending to a production pipeline is not as straightforward. Lot of nuances go into the tuning of hyperparameters at that scale.