# core.solver.ReliabilityModel { #mlsysim.core.solver.ReliabilityModel } ```python core.solver.ReliabilityModel() ``` Calculates Mean Time Between Failures (MTBF) and optimal checkpointing intervals. This solver handles the reliability modeling of massive clusters, helping determine the 'Goodput' of long-running training jobs. It identifies the probability of a job failure before completion and calculates the Young-Daly optimal interval to minimize wasted compute time. Literature Source: 1. Young (1974), "A First-Order Approximation to the Optimum Checkpoint Interval." 2. Daly (2006), "A Higher Order Estimate of the Optimum Checkpoint Interval for Restart-Dump Strategy." ## Methods | Name | Description | | --- | --- | | [solve](#mlsysim.core.solver.ReliabilityModel.solve) | Calculates reliability and checkpointing metrics for a fleet. | ### solve { #mlsysim.core.solver.ReliabilityModel.solve } ```python core.solver.ReliabilityModel.solve( fleet, job_duration_hours, checkpoint_time_s=60.0, ) ``` Calculates reliability and checkpointing metrics for a fleet.