Machine Learning Systems
Principles and Practices of Engineering Artificially Intelligent Systems
📘 Volume I • 📙 Volume II (Summer 2026) • Tiny🔥Torch • 🚀 MLSys·im • 🌐 Ecosystem
📚 Hardcopy edition coming 2026 with MIT Press.
Note
You are on the
devbranch. This is the default branch and where active development happens. I am restructuring the textbook from a single volume into two focused, tighter volumes. New content is being added, existing content is being refined, and diagrams are being updated throughout. For the last stable release, see themainbranch.
Branch Guide
This repository uses dev as the default branch. Here is how the branches relate to the book:
┌─────────────────────────────────────────────────────────────────────────┐
│ BRANCH STRUCTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ main (last stable release) │
│ ├── Single-volume textbook (published and available) │
│ └── Stable PDF, EPUB, and online edition │
│ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │
│ └──┤ RESTRUCTURING IN PROGRESS │ │
│ │ Refocusing into two tighter volumes │ │
│ └────────────────────┬─────────────────────────────────┘ │
│ │ │
│ dev (default branch, you are here) │
│ ├── 📘 Volume I: Introduction to Machine Learning Systems │
│ │ Foundations for single-machine ML (1 to 8 GPUs) │
│ │ Status: Content complete, undergoing editorial polish │
│ │ │
│ └── 📙 Volume II: Machine Learning Systems at Scale │
│ Distributed systems and production infrastructure │
│ Status: Active development, chapters being written │
│ │
└─────────────────────────────────────────────────────────────────────────┘
| Branch | What It Contains | Status |
|---|---|---|
main |
Last stable single-volume release | Stable, available for reading |
dev (you are here) |
Two-volume restructured textbook | Under active development |
Working in the Open
I develop this textbook in the open. Like an artist who paints in a public studio, I do my work where anyone can watch, learn from the process, and contribute.
What this means for you:
- Volume I content is mature and undergoing final editorial polish. It is ready for classroom use.
- Volume II is actively being written. Chapters, diagrams, and sections are in various stages of completion. Expect rough edges, placeholder figures, and sections under construction. This is normal. You are seeing the book being built.
- The transition from one volume to two means some cross-references, navigation, and structure are still being updated.
I believe open development produces better textbooks. Every commit, every revision, every editorial decision is visible. If you want the polished, stable version, use the main branch. If you want to see where the book is headed, or help shape it, you are in the right place.
Mission
The world is rushing to build AI systems. It is not engineering them.
That gap is what I mean by AI engineering.
AI engineering is the discipline of building efficient, reliable, safe, and robust intelligent systems that operate in the real world, not just models in isolation.
The mission of this project: Establish AI engineering as a foundational discipline, alongside software engineering and computer engineering, by teaching how to design, build, and evaluate end to end intelligent systems. The long term impact of AI will be shaped by engineers who can turn ideas into working, dependable systems.
Start Here
This repository is the open learning stack for AI systems engineering: textbook source, TinyTorch, hardware kits, and upcoming co-labs that connect principles to runnable code and real devices. Choose a path based on your goal.
READ Start with the textbook:
- 📘 Volume I: Introduction to Machine Learning Systems covers ML basics, development, optimization, and operations. Available now.
- 📙 Volume II: Machine Learning Systems at Scale covers distributed systems, production infrastructure, and responsible AI at scale. Coming Summer 2026.
BUILD Start TinyTorch with the getting started guide. Begin with Module 01 and work up from CNNs to transformers and the MLPerf benchmarks.
DEPLOY Pick a hardware kit and run the labs on Arduino, Raspberry Pi, and other edge devices.
SIMULATE Explore the MLSys·im Engine to calculate the physics of ML infrastructure and run declarative IaC configurations from your terminal.
CONNECT Say hello in Discussions. We will do our best to reply.
The Learning Stack
The learning stack below shows how the textbook connects to hands on work and deployment. Read the textbook, then pick your path:
┌───────────────────────────────────────────────────────────────────────────────┐
│ │
│ MACHINE LEARNING SYSTEMS │
│ Read the Textbook │
│ │
│ Theory • Concepts • Best Practices │
│ │
└───────────────────────────────────────┬───────────────────────────────────────┘
│
┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
┌───────────────────────────────────────────────────────────────────────────────┐
│ HANDS-ON ACTIVITIES │
│ (pick one or all) │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ │ │ │ │ │ │
│ │ SOFTWARE │ │ TINYTORCH │ │ HARDWARE │ │
│ │ CO-LABS │ │ FRAMEWORK │ │ LABS │ │
│ │ │ │ │ │ │ │
│ │ EXPLORE │ │ BUILD │ │ DEPLOY │ │
│ │ │ │ │ │ │ │
│ │ Run controlled │ │ Understand │ │ Engineer under │ │
│ │ experiments on │ │ frameworks by │ │ real constraints│ │
│ │ latency, memory,│ │ implementing │ │ memory, power, │ │
│ │ energy, cost │ │ them │ │ timing, safety │ │
│ │ │ │ │ │ │ │
│ │ (coming 2026) │ │ │ │ Arduino, Pi │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ EXPLORE BUILD DEPLOY │
│ │
└───────────────────────────────────────┬───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────────────┐
│ │
│ AI OLYMPICS │
│ Prove Mastery │
│ │
│ Compete across all tracks • University teams • Public leaderboards │
│ │
│ (coming 2026) │
│ │
└───────────────────────────────────────────────────────────────────────────────┘
| Component | What You Do | Link | |
|---|---|---|---|
| READ | 📖 Textbook | Understand ML systems concepts | book/ |
| 📘 Volume I | Build, Optimize, Deploy | ||
| 📙 Volume II | Scale, Distribute, Govern | (Summer 2026) | |
| EXPLORE | 🔮 Software Co-Labs | Run controlled experiments on latency, memory, energy, cost | Coming 2026 |
| BUILD | 🔥 TinyTorch | Understand frameworks by implementing them | tinytorch/ |
| DEPLOY | 🔧 Hardware Kits | Engineer under real constraints: memory, power, timing, safety | kits/ |
| PROVE | 🏆 AI Olympics | Compete and benchmark across all tracks | Coming 2026 |
What each path teaches:
- EXPLORE teaches why — Understand tradeoffs. Change batch sizes, precision, model architectures and see how latency, memory, and accuracy shift.
- BUILD teaches how — Understand internals. Implement autograd, optimizers, and attention from scratch to see how TensorFlow and PyTorch actually work.
- DEPLOY teaches where — Understand constraints. Face real memory limits, power budgets, and latency requirements on actual hardware.
What You Will Learn
This textbook teaches you to think at the intersection of machine learning and systems engineering. Each chapter bridges algorithmic concepts with the infrastructure that makes them work in practice.
The ML ↔ Systems Bridge
| ML Concept | Systems Concept | What You Learn |
|---|---|---|
| Model parameters | Memory constraints | How to fit large models on resource-limited devices |
| Inference latency | Hardware acceleration | How GPUs, TPUs, and accelerators execute neural networks |
| Training convergence | Compute efficiency | How mixed-precision and optimization techniques reduce cost |
| Model accuracy | Quantization and pruning | How to compress models while preserving performance |
| Data requirements | Pipeline infrastructure | How to build efficient data loading and preprocessing |
| Model deployment | MLOps practices | How to monitor, version, and update models in production |
| Privacy constraints | On-device learning | How to train and adapt models without sending data to the cloud |
Book Structure
This textbook started as a single volume. I am restructuring it into two focused, tighter volumes. The restructuring is not simply splitting the original in half. It is an editorial refinement: sharpening each volume's focus, removing overlap, and ensuring each stands on its own. Volume I is complete and available now. Volume II is actively being developed and will be published Summer 2026. The main branch still holds the original single-volume edition. This dev branch is where the two-volume restructuring is happening.
| Volume | Title | Focus | Status |
|---|---|---|---|
| Volume I | Introduction to Machine Learning Systems | One machine, one to eight accelerators. Foundations, optimization, and deployment. | Available now |
| Volume II | Machine Learning Systems at Scale | Many machines, thousands of accelerators. Distributed training, infrastructure, and production at scale. | Coming Summer 2026 |
The full textbook combines both volumes for comprehensive coverage.
Volume I: Introduction to Machine Learning Systems
| Part | Focus | Chapters |
|---|---|---|
| I. Foundations | Core concepts | Introduction, ML Systems, ML Workflow, Data Engineering |
| II. Development | Building blocks | Neural Computation, Architectures, Frameworks, Training |
| III. Optimization | Making it fast | Data Selection, Model Compression, HW Acceleration, Benchmarking |
| IV. Deployment | Making it work | Model Serving, MLOps, Responsible Engineering |
Volume II: Machine Learning Systems at Scale (Coming Summer 2026)
Caution
Volume II is a work in progress. All 16 chapters exist and the structure is locked in, but the content within each chapter is still being written and revised. I share it openly because I believe in transparent development.
Volume II picks up where Volume I ends, moving from a single machine to fleets of machines. It covers the mathematical and algorithmic demand for scale, how to build the physical infrastructure that meets it, how to serve models to billions of users, and how to do all of this safely and responsibly.
| Part | Focus | Chapters |
|---|---|---|
| I. Foundations of Scale | The logic of distributed systems | Introduction to Scale, Distributed Training, Collective Communication, Fault Tolerance |
| II. Building the Fleet | Physical infrastructure | Compute Infrastructure, Network Fabrics, Data Storage, Fleet Orchestration |
| III. Deployment at Scale | Serving at global scale | Inference at Scale, Performance Engineering, Edge Intelligence, Ops at Scale |
| IV. Production Concerns | Safety and governance | Security & Privacy, Robust AI, Sustainable AI, Responsible AI |
Design Philosophy
This is a living textbook. I keep it updated as the field grows, with community input along the way.
AI headlines move fast. The engineering principles underneath move much more slowly. Parallelism, memory hierarchies, reliability mathematics, and quantization theory are as relevant today as they were a decade ago and will remain so a decade from now. This textbook is built around those enduring foundations.
Whether you are reading a chapter, running a lab, or sharing feedback, you are helping make these ideas more accessible to the next learner.
Research to Teaching Loop
The same loop drives both research and teaching: define the system problem, build a reference implementation, benchmark it, then turn it into curriculum and tooling so others can reproduce and extend it.
| Loop Step | Research Artifacts | Teaching Artifacts |
|---|---|---|
| Measure | Benchmarks, suites, metrics | Benchmarking chapter, assignments |
| Build | Reference systems, compilers, runtimes | TinyTorch modules, co-labs |
| Deploy | Hardware targets, constraints, reliability | Hardware labs, kits |
Support This Work
We are working toward 1 million learners by 2030 so that AI engineering becomes a shared, teachable discipline, not a collection of isolated practices. Every star, share, and contribution helps move this effort forward.
Why GitHub Stars Matter
What gets measured gets improved.
Each star is a learner, educator, or supporter who believes AI systems should be engineered with rigor and real world constraints in mind.
1 learner → 10 learners → 100 learners → 1,000 learners → 10,000 learners → 100,000 learners → 1M learners
Stars are not the goal. They are a signal.
A visible, growing community makes it easier for universities, foundations, and industry partners to adopt this material, donate hardware, and fund workshops. That momentum lowers the barrier for the next institution, the next classroom, and the next cohort of learners.
Support raised through this signal flows into Open Collective and funds concrete outcomes such as TinyML4D workshops, hardware kits for underserved classrooms, and the infrastructure required to keep this resource free and open.
One click can unlock the next classroom, the next contributor, and the next generation of AI engineers.
Fund the Mission
All contributions go to Open Collective, a transparent fund that supports educational outreach.
Community and Resources
| Resource | Description |
|---|---|
| 📖 Full Textbook | Complete interactive online textbook (both volumes) |
| 📘 Volume I | Build, Optimize, Deploy |
| 📙 Volume II | Scale, Distribute, Govern (Summer 2026) |
| 🧮 MLSys·im | First-principles analytical simulator |
| 🔥 TinyTorch | Build ML frameworks from scratch |
| 🔧 Hardware Kits | Deploy to Arduino, Raspberry Pi, edge devices |
| 🌐 Ecosystem | Resources, workshops, and community |
| 💬 Discussions | Questions and ideas |
Contributing
Contributions to the book, TinyTorch, and hardware kits are welcome!
| I want to... | Go here |
|---|---|
| Fix a typo or improve a chapter | book/docs/CONTRIBUTING.md |
| Add a TinyTorch module or fix a bug | tinytorch/CONTRIBUTING.md |
| Improve hardware labs | kits/README.md |
| Report an issue | GitHub Issues |
| Ask a question | GitHub Discussions |
Citation & License
Citation
@inproceedings{reddi2024mlsysbook,
title = {MLSysBook.AI: Principles and Practices of Machine Learning Systems Engineering},
author = {Reddi, Vijay Janapa},
booktitle = {2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS)},
pages = {41--42},
year = {2024},
organization = {IEEE},
url = {https://mlsysbook.org}
}
License
This project uses a dual-license structure:
| Component | License | What It Means |
|---|---|---|
| Book content | CC BY-NC-ND 4.0 | Share freely with attribution; no commercial use; no derivatives |
| TinyTorch code | Apache 2.0 | Use, modify, and distribute freely; includes patent protection |
The textbook content (chapters, figures, explanations) is educational material that should circulate with attribution and without commercial exploitation. The software framework is a tool designed to be easy for anyone to use, modify, or integrate into their own projects.
Contributors
Thanks goes to these wonderful people who have contributed to making this resource better for everyone!
Legend: 🪲 Bug Hunter · 🧑💻 Code Contributor · ✍️ Doc Wizard · 🎨 Design Artist · 🧠 Idea Spark · 🔎 Code Reviewer · 🧪 Test Tinkerer · 🛠️ Tool Builder
📖 Textbook Contributors
🔥 TinyTorch Contributors
Vijay Janapa Reddi 🪲 🧑💻 🎨 ✍️ 🧠 🔎 🧪 🛠️ |
kai 🪲 🧑💻 🎨 ✍️ 🧪 |
Dang Truong 🪲 🧑💻 ✍️ 🧪 |
Didier Durand 🪲 🧑💻 ✍️ |
Pratham Chaudhary 🪲 🧑💻 ✍️ |
Karthik Dani 🪲 🧑💻 |
Avik De 🪲 🧪 |
Takosaga 🪲 ✍️ |
rnjema 🧑💻 🛠️ |
joeswagson 🧑💻 🛠️ |
AndreaMattiaGaravagno 🧑💻 ✍️ |
Rolds 🪲 🧑💻 |
Amir Alasady 🪲 |
jettythek 🧑💻 |
wzz 🪲 |
Ng Bo Lin ✍️ |
keo-dara 🪲 |
Wayne Norman 🪲 |
Ilham Rafiqin 🪲 |
Oscar Flores ✍️ |
harishb00a ✍️ |
Pastor Soto ✍️ |
Salman Chishti 🧑💻 |
Aditya Mulik ✍️ |
🛠️ Hardware Kits Contributors
Vijay Janapa Reddi 🪲 🧑💻 🎨 ✍️ 🧪 🛠️ |
Marcelo Rovai ✍️ 🧑💻 🎨 |
Salman Chishti 🧑💻 |
Pratham Chaudhary 🧑💻 |
🧪 Labs Contributors
Vijay Janapa Reddi 🧑💻 🎨 ✍️ |
Salman Chishti 🧑💻 |
Pratham Chaudhary 🧑💻 |
⭐ Star us on GitHub • ✉️ Subscribe • 💬 Join discussions • 🌐 Visit mlsysbook.ai
Made with ❤️ for AI engineers
in the making, around the world 🌎