Training and Deploying Large-Scale Models

Course objective

This course introduces the foundations and practices of training modern Large Language Models (LLMs) at scale. You will learn how deep learning models are trained across multiple GPUs, nodes, and clusters—and why distributed training is essential for today’s largest AI systems.

We will cover:

Core techniques for distributed training
Modern frameworks and scaling strategies
Practical implementations with real-world toolchains
Theoretical underpinnings of large-scale learning
Inference and applications

As LLMs grow in complexity and impact, understanding how they are built and deployed has become essential for researchers and engineers. This series bridges engineering and theory.

Organization

Enrollment limited to 60 students (no external auditors)
7 sessions: 2h lecture + 2h hands-on lab
Final session: grading
Projects and HWs are done in groups of 2

Bring your own laptop for the lab sessions, and make sure you can install the nightly version of PyTorch with GPU support before the first lab.

Lectures & Labs

#	Topic	Date	Materials
1	Getting Started on Distributed LLM Training	22/01/26	Slides Lab
2	Systems for ML	29/01/26	Slides Lab
3	Parallelization techniques for LLM training	05/02/26	Slides Lab
4	Communication bottlenecks and decentralized training	12/02/26	Slides Lab
5	Post-training	26/02/26	Slides Lab
6	Serving and deployment	05/03/26	Slides Lab
7	Agentic AI (tentative)	12/03/26	Slides Lab
8	Grading	26/03/26	Instructions

Credits

Lab setup: install & verify pytorch / torchtitan / torchft / netbird (click to expand)

Use a clean virtual environment, then follow the steps below.

pytorch (labs 1–6)
Install the PyTorch nightly build using the official instructions.
torchtitan (labs 2, 4, 5)
Install from source using the official instructions.
Quick test:
NGPU=1 CONFIG_FILE="./torchtitan/models/nanogpt/train_configs/debug_model.toml" ./run_train.sh
torchft (lab 4)
Install via pip using the official instructions.
To test, run the lighthouse and two replicas following the usage guide.
netbird (lab 4)
Install using the official instructions. NetBird is a VPN and may require sudo. If you cannot use sudo, try the Docker setup.
Quick test: confirm that netbird up works. In practice, we will use netbird up --setup-key $KEY.

Grading & Deadlines

HW1: 25% HW2: 25% Project: 50%

Project: choose one paper (NeurIPS, MLSys, or similar) related to the lecture topics and get approval before you start. Grade and Groups.

Item	Release	Due	Links
Group constitution	-	05/02/26	Instruction
HW1	05/02/26	19/02/26	Statement
HW2	12/02/26	26/02/26	Statement
Project proposal	-	26/02/26	Guidelines
Final report / poster	26/03/26	26/03/26