Files
cs249r_book/mlsysim/docs/api/core.solver.ReliabilityModel.qmd
Vijay Janapa Reddi 611de228d9 fix(mlsysim): align docs with *Model naming convention
The solver.py refactoring renamed most solver classes from *Solver to
*Model (e.g. DistributedSolver → DistributedModel). The docs still
referenced the old names, causing the Quarto site build to fail with:
  ImportError: cannot import name 'DistributedSolver' from 'mlsysim'

- Fix executable code cells in tutorials/distributed.qmd
- Update non-executable code examples across 10 doc files
- Rename 19 API reference files from *Solver.qmd to *Model.qmd
- SensitivitySolver and SynthesisSolver retain their names (correct)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 08:39:11 -04:00

37 lines
1.1 KiB
Plaintext

# core.solver.ReliabilityModel { #mlsysim.core.solver.ReliabilityModel }
```python
core.solver.ReliabilityModel()
```
Calculates Mean Time Between Failures (MTBF) and optimal checkpointing intervals.
This solver handles the reliability modeling of massive clusters, helping
determine the 'Goodput' of long-running training jobs. It identifies
the probability of a job failure before completion and calculates the
Young-Daly optimal interval to minimize wasted compute time.
Literature Source:
1. Young (1974), "A First-Order Approximation to the Optimum Checkpoint
Interval."
2. Daly (2006), "A Higher Order Estimate of the Optimum Checkpoint
Interval for Restart-Dump Strategy."
## Methods
| Name | Description |
| --- | --- |
| [solve](#mlsysim.core.solver.ReliabilityModel.solve) | Calculates reliability and checkpointing metrics for a fleet. |
### solve { #mlsysim.core.solver.ReliabilityModel.solve }
```python
core.solver.ReliabilityModel.solve(
fleet,
job_duration_hours,
checkpoint_time_s=60.0,
)
```
Calculates reliability and checkpointing metrics for a fleet.