Course Thumbnail

Welcome to LLMs from Scratch, an all-killer-no-filler curriculum that takes you from tokenization to alignment with meticulously crafted Jupyter notebooks, actionable theory, and production-ready code. Whether you are a researcher, engineer, or curious builder, this course gives you the scaffolding to demystify modern LLMs and deploy your own.

๐Ÿš€ Course Highlights#

  • Hands-on notebooks for every lessonโ€”clone locally or launch instantly in Lightning Studio.

  • Practical checkpoints and datasets so you can experiment without babysitting boilerplate.

  • Theory, references, and best practices interwoven with code so every concept sticks.

  • Production-aware workflow covering training, scaling, alignment, quantization, and deployment-friendly fine-tuning.

๐Ÿ“š Course Structure#

Each module is a standalone notebook packed with explanations, exercises, and implementation details. View them on GitHub, launch them via GitHub Pages, or open them interactively in Lightning Studio.

Module

Topic

Notebook

01

Tokenization Foundations

01-tokenization.ipynb

02

Building a Tiny LLM

02-tinyllm.ipynb

03

Advancing Our LLM

03-advancing-our-llm.ipynb

04

Data Engineering for LLMs

04-data.ipynb

05

Scaling Laws in Practice

05-scaling-laws.ipynb

06

Pretraining at Scale

06-pretraining.ipynb

07

Supervised Fine-Tuning

07-supervised-finetuning.ipynb

08

RLHF and Alignment

08-rlhf-alignment.ipynb

09

LoRA & RLVR Techniques

09-lora-rlvr.ipynb

10

Pruning & Distillation

10-pruning-distillation.ipynb

11

Appendix: Position Embeddings

11-appendix-position-embeddings.ipynb

12

Appendix: Quantisation Strategies

12-appendix-quantisation.ipynb

13

Appendix: Parameter-Efficient Tuning

13-appendix-peft.ipynb

14

Bonus: Energy Based and Diffusion LLMs

14-bonus-diffusion-llms.ipynb

15

Bonus: State Space Models

15-bonus-state-space-models.ipynb

๐Ÿง  What Youโ€™ll Learn#

  • The end-to-end data flow of an LLMโ€”from tokenization and batching to inference-time decoding.

  • How to implement core transformer components, attention variations, and optimization tricks.

  • Strategies for scaling datasets, managing checkpoints, and monitoring training stability.

  • Practical alignment techniques: SFT, preference modeling, RLHF, and reward modeling.

  • Deployment-ready compression: pruning, distillation, quantization, and PEFT recipes.

  • Bonus sections on Energy based models (EBMs), Diffusion LLMs, and State Space Models (SSMs).

โš™๏ธ Quick Start#

Option A: Launch in Lightning Studio (no setup!)#

  1. Click the Open in Studio badge above.

  2. Authenticate with Lightning (or create a free account).

  3. Explore the notebooks in a fully provisioned environment with GPU options.

  4. The studio has all model checkpoints saved and you can test them with code given in test-model.ipynb.

Option B: Run Locally#

  1. Clone the repository

    git clone https://github.com/shreshthtuli/llms-from-scratch.git
    cd llms-from-scratch
    
  2. Install dependencies (recommended: Python 3.10+)

    pip install uv
    uv sync
    
  3. Add API Keys in .env file. Follow .env.example.

  4. Launch Jupyter

    jupyter lab
    
  5. Open any notebook to start experimenting.

Need data? Check the data/ directory and follow the dataset preparation steps inside each notebook.

๐Ÿงญ Suggested Learning Path#

  1. Foundations (Modules 01โ€“03) โ€“ Understand tokens, build your first transformer, and iterate on architecture improvements.

  2. Data & Scaling (Modules 04โ€“06) โ€“ Curate corpora, tune training loops, and scale pretraining experiments responsibly.

  3. Alignment (Modules 07โ€“09) โ€“ Apply SFT, RLHF, and efficient adaptation techniques to align your model with human intent.

  4. Optimization (Modules 10โ€“15) โ€“ Compress, fine-tune, and deploy models using state-of-the-art efficiency tricks.

  5. Capstone โ€“ Combine your learnings to train, align, and ship a bespoke LLM tailored to your use case.

Mix and match as neededโ€”every notebook is designed to stand on its own, but following this order unlocks the smoothest learning curve.

๐Ÿ›  Hands-On Playground#

  • Lightning Studio: Run the entire repo in the cloud with zero setup using the badge above.

  • GitHub Codespaces: Launch a dev container directly from the repo for quick edits.

  • Local GPUs / Clusters: Scripts in src/ support distributed and mixed-precision training out of the box.

๐Ÿ‘จโ€๐Ÿซ About the Instructor#

Iโ€™m Shreshth Tuliโ€”researcher, builder, and educator focused on making advanced ML systems approachable. Iโ€™ve shipped production LLMs, authored peer-reviewed papers, and taught hundreds of practitioners how to wield these models responsibly. Expect honest takes, transparent trade-offs, and plenty of real-world war stories.

More about me here. Connect with me on LinkedIn.

๐Ÿค Contributions#

Contributions, bug reports, and suggestions are warmly welcomed! To contribute:

  1. Fork the repo and create a feature branch.

  2. Open a PR describing your changes and the motivation behind them.

  3. Tag any relevant notebooks or scripts and include screenshots/metrics if applicable.

Check the issue tracker for bite-sized tasks or open a discussion if you want to propose new modules.

๐Ÿ“„ License#

This project is open-sourced under the Apache 2.0 License. Feel free to use the materials for your own learning, workshops, or derivative coursesโ€”just keep attribution intact.

The best way to learn LLMs is to build one. ๐Ÿš€