Files
cs249r_book/mlsysim/docs/api/core.solver.ReliabilitySolver.qmd
Vijay Janapa Reddi 2bbe3e1a69 docs(mlsysim): redesign website, add 12 tutorials, and CLI entry points
Replace 9 old tutorials with 12 new numbered tutorials (00-11) covering
roofline through full-stack audit. Redesign landing page, add
models-and-solvers and extending-the-engine guides. Add __main__.py,
cli.py, and cli/ package for command-line interface.
2026-03-12 16:04:51 -04:00

37 lines
1.1 KiB
Plaintext

# core.solver.ReliabilityModel { #mlsysim.core.solver.ReliabilityModel }
```python
core.solver.ReliabilityModel()
```
Calculates Mean Time Between Failures (MTBF) and optimal checkpointing intervals.
This solver handles the reliability modeling of massive clusters, helping
determine the 'Goodput' of long-running training jobs. It identifies
the probability of a job failure before completion and calculates the
Young-Daly optimal interval to minimize wasted compute time.
Literature Source:
1. Young (1974), "A First-Order Approximation to the Optimum Checkpoint
Interval."
2. Daly (2006), "A Higher Order Estimate of the Optimum Checkpoint
Interval for Restart-Dump Strategy."
## Methods
| Name | Description |
| --- | --- |
| [solve](#mlsysim.core.solver.ReliabilityModel.solve) | Calculates reliability and checkpointing metrics for a fleet. |
### solve { #mlsysim.core.solver.ReliabilityModel.solve }
```python
core.solver.ReliabilityModel.solve(
fleet,
job_duration_hours,
checkpoint_time_s=60.0,
)
```
Calculates reliability and checkpointing metrics for a fleet.