mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 18:18:42 -05:00
The solver.py refactoring renamed most solver classes from *Solver to *Model (e.g. DistributedSolver → DistributedModel). The docs still referenced the old names, causing the Quarto site build to fail with: ImportError: cannot import name 'DistributedSolver' from 'mlsysim' - Fix executable code cells in tutorials/distributed.qmd - Update non-executable code examples across 10 doc files - Rename 19 API reference files from *Solver.qmd to *Model.qmd - SensitivitySolver and SynthesisSolver retain their names (correct) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
50 lines
1.6 KiB
Plaintext
50 lines
1.6 KiB
Plaintext
# core.solver.CheckpointModel { #mlsysim.core.solver.CheckpointModel }
|
|
|
|
```python
|
|
core.solver.CheckpointModel()
|
|
```
|
|
|
|
Analyzes the storage constraints and I/O burst penalties of saving model states.
|
|
|
|
Training massive models requires saving hundreds of gigabytes (weights +
|
|
optimizer states) to persistent storage. This solver calculates the time
|
|
spent blocked on I/O, subtracting from the cluster's Model FLOPs Utilization.
|
|
|
|
Literature Source:
|
|
1. Eisenman et al. (2022), "Check-N-Run: A Checkpointing System for
|
|
Training Large Language Models."
|
|
|
|
## Methods
|
|
|
|
| Name | Description |
|
|
| --- | --- |
|
|
| [solve](#mlsysim.core.solver.CheckpointModel.solve) | Solves for checkpoint size, write time, and resulting MFU penalty. |
|
|
|
|
### solve { #mlsysim.core.solver.CheckpointModel.solve }
|
|
|
|
```python
|
|
core.solver.CheckpointModel.solve(
|
|
model,
|
|
hardware,
|
|
optimizer='adam',
|
|
checkpoint_interval_hours=4.0,
|
|
)
|
|
```
|
|
|
|
Solves for checkpoint size, write time, and resulting MFU penalty.
|
|
|
|
#### Parameters {.doc-section .doc-section-parameters}
|
|
|
|
| Name | Type | Description | Default |
|
|
|------|------|-------------|---------|
|
|
| model | Workload | The model architecture. | _required_ |
|
|
| hardware | HardwareNode | The target hardware for storage bandwidth. | _required_ |
|
|
| optimizer | str | Optimizer type ('adam' or 'sgd'), determines bytes per parameter. | `'adam'` |
|
|
| checkpoint_interval_hours | float | Hours between checkpoints. | `4.0` |
|
|
|
|
#### Returns {.doc-section .doc-section-returns}
|
|
|
|
| Name | Type | Description |
|
|
|------|------|-------------|
|
|
| | CheckpointResult | Checkpoint size (GB), write time, storage bottleneck flag, and MFU penalty percentage. |
|