Course objective
This course introduces the foundations and practices of training modern Large Language Models (LLMs) at scale. You will learn how deep learning models are trained across multiple GPUs, nodes, and clusters—and why distributed training is essential for today’s largest AI systems.
We will cover:
- Core techniques for distributed training
- Modern frameworks and scaling strategies
- Practical implementations with real-world toolchains
- Theoretical underpinnings of large-scale learning
- Inference and applications
As LLMs grow in complexity and impact, understanding how they are built and deployed has become essential for researchers and engineers. This series bridges engineering and theory.
Organization
- Enrollment limited to 60 students (no external auditors)
- 7 sessions: 2h lecture + 2h hands-on lab
- Final session: grading
- Projects and HWs are done in groups of 2
Bring your own laptop for the lab sessions, and make sure you can install the nightly version of PyTorch with GPU support before the first lab.
Lectures & Labs
| # | Topic | Date | Materials |
|---|---|---|---|
| 1 | Getting Started on Distributed LLM Training | 22/01/26 | Slides Lab |
| 2 | Systems for ML | 29/01/26 | Slides Lab |
| 3 | Parallelization techniques for LLM training | 05/02/26 | Slides Lab |
| 4 | Communication bottlenecks and decentralized training | 12/02/26 | Slides Lab |
| 5 | Post-training | 26/02/26 | Slides Lab |
| 6 | Serving and deployment | 05/03/26 | Slides Lab |
| 7 | Agentic AI (tentative) | 12/03/26 | Slides Lab |
| 8 | Grading | 26/03/26 | Instructions |
Credits
Lab setup: install & verify pytorch / torchtitan / torchft / netbird (click to expand)
Use a clean virtual environment, then follow the steps below.
-
pytorch (labs 1–6)
Install the PyTorch nightly build using the official instructions. -
torchtitan (labs 2, 4, 5)
Install from source using the official instructions.
Quick test:
NGPU=1 CONFIG_FILE="./torchtitan/models/nanogpt/train_configs/debug_model.toml" ./run_train.sh -
torchft (lab 4)
Install via pip using the official instructions.
To test, run the lighthouse and two replicas following the usage guide. -
netbird (lab 4)
Install using the official instructions. NetBird is a VPN and may requiresudo. If you cannot usesudo, try the Docker setup.
Quick test: confirm thatnetbird upworks. In practice, we will usenetbird up --setup-key $KEY.
Grading & Deadlines
Project: choose one paper (NeurIPS, MLSys, or similar) related to the lecture topics and get approval before you start. Grade and Groups.
| Item | Release | Due | Links |
|---|---|---|---|
| Group constitution | - | 05/02/26 | Instruction |
| HW1 | 05/02/26 | 19/02/26 | Statement |
| HW2 | 12/02/26 | 26/02/26 | Statement |
| Project proposal | - | 26/02/26 | Guidelines |
| Final report / poster | 26/03/26 | 26/03/26 |