Training and Deploying Large-Scale Models

MVA 2025-2026 — 2nd semester

Instructor: E. Oyallon
Format: 8 sessions
Distributed training Scaling strategies Modern toolchains

Course objective

This course introduces the foundations and practices of training modern Large Language Models (LLMs) at scale. You will learn how deep learning models are trained across multiple GPUs, nodes, and clusters—and why distributed training is essential for today’s largest AI systems.

We will cover:

  • Core techniques for distributed training
  • Modern frameworks and scaling strategies
  • Practical implementations with real-world toolchains
  • Theoretical underpinnings of large-scale learning
  • Inference and applications

As LLMs grow in complexity and impact, understanding how they are built and deployed has become essential for researchers and engineers. This series bridges engineering and theory.

Organization

  • Enrollment limited to 60 students (no external auditors)
  • 7 sessions: 2h lecture + 2h hands-on lab
  • Final session: grading
  • Projects and HWs are done in groups of 2

Bring your own laptop for the lab sessions, and make sure you can install the nightly version of PyTorch with GPU support before the first lab.

Lectures & Labs

# Topic Date Materials
1 Getting Started on Distributed LLM Training 22/01/26
2 Systems for ML 29/01/26
3 Parallelization techniques for LLM training 05/02/26
4 Communication bottlenecks and decentralized training 12/02/26
5 Post-training 26/02/26
6 Serving and deployment 05/03/26
7 Agentic AI (tentative) 12/03/26
8 Grading 26/03/26

Credits

Lab setup: install & verify pytorch / torchtitan / torchft / netbird (click to expand)

Use a clean virtual environment, then follow the steps below.

  • pytorch (labs 1–6)
    Install the PyTorch nightly build using the official instructions.
  • torchtitan (labs 2, 4, 5)
    Install from source using the official instructions.
    Quick test:
    NGPU=1 CONFIG_FILE="./torchtitan/models/nanogpt/train_configs/debug_model.toml" ./run_train.sh
  • torchft (lab 4)
    Install via pip using the official instructions.
    To test, run the lighthouse and two replicas following the usage guide.
  • netbird (lab 4)
    Install using the official instructions. NetBird is a VPN and may require sudo. If you cannot use sudo, try the Docker setup.
    Quick test: confirm that netbird up works. In practice, we will use netbird up --setup-key $KEY.

Grading & Deadlines

HW1: 25% HW2: 25% Project: 50%

Project: choose one paper (NeurIPS, MLSys, or similar) related to the lecture topics and get approval before you start. Grade and Groups.

Item Release Due Links
Group constitution - 05/02/26
HW1 05/02/26 19/02/26
HW2 12/02/26 26/02/26
Project proposal - 26/02/26
Final report / poster 26/03/26 26/03/26