mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 18:18:42 -05:00
The solver.py refactoring renamed most solver classes from *Solver to *Model (e.g. DistributedSolver → DistributedModel). The docs still referenced the old names, causing the Quarto site build to fail with: ImportError: cannot import name 'DistributedSolver' from 'mlsysim' - Fix executable code cells in tutorials/distributed.qmd - Update non-executable code examples across 10 doc files - Rename 19 API reference files from *Solver.qmd to *Model.qmd - SensitivitySolver and SynthesisSolver retain their names (correct) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
37 lines
1.1 KiB
Plaintext
37 lines
1.1 KiB
Plaintext
# core.solver.ReliabilityModel { #mlsysim.core.solver.ReliabilityModel }
|
|
|
|
```python
|
|
core.solver.ReliabilityModel()
|
|
```
|
|
|
|
Calculates Mean Time Between Failures (MTBF) and optimal checkpointing intervals.
|
|
|
|
This solver handles the reliability modeling of massive clusters, helping
|
|
determine the 'Goodput' of long-running training jobs. It identifies
|
|
the probability of a job failure before completion and calculates the
|
|
Young-Daly optimal interval to minimize wasted compute time.
|
|
|
|
Literature Source:
|
|
1. Young (1974), "A First-Order Approximation to the Optimum Checkpoint
|
|
Interval."
|
|
2. Daly (2006), "A Higher Order Estimate of the Optimum Checkpoint
|
|
Interval for Restart-Dump Strategy."
|
|
|
|
## Methods
|
|
|
|
| Name | Description |
|
|
| --- | --- |
|
|
| [solve](#mlsysim.core.solver.ReliabilityModel.solve) | Calculates reliability and checkpointing metrics for a fleet. |
|
|
|
|
### solve { #mlsysim.core.solver.ReliabilityModel.solve }
|
|
|
|
```python
|
|
core.solver.ReliabilityModel.solve(
|
|
fleet,
|
|
job_duration_hours,
|
|
checkpoint_time_s=60.0,
|
|
)
|
|
```
|
|
|
|
Calculates reliability and checkpointing metrics for a fleet.
|