cs249r_book/mlsysim/docs/api/core.solver.CheckpointModel.qmd

# core.solver.CheckpointModel { #mlsysim.core.solver.CheckpointModel }

```python
core.solver.CheckpointModel()
```

Analyzes the storage constraints and I/O burst penalties of saving model states.

Training massive models requires saving hundreds of gigabytes (weights +
optimizer states) to persistent storage. This solver calculates the time
spent blocked on I/O, subtracting from the cluster's Model FLOPs Utilization.

Literature Source:
1. Eisenman et al. (2022), "Check-N-Run: A Checkpointing System for
   Training Large Language Models."

## Methods

| Name | Description |
| --- | --- |
| [solve](#mlsysim.core.solver.CheckpointModel.solve) | Solves for checkpoint size, write time, and resulting MFU penalty. |

### solve { #mlsysim.core.solver.CheckpointModel.solve }

```python
core.solver.CheckpointModel.solve(
    model,
    hardware,
    optimizer='adam',
    checkpoint_interval_hours=4.0,
)
```

Solves for checkpoint size, write time, and resulting MFU penalty.

#### Parameters {.doc-section .doc-section-parameters}

| Name | Type | Description | Default |
|------|------|-------------|---------|
| model | Workload | The model architecture. | _required_ |
| hardware | HardwareNode | The target hardware for storage bandwidth. | _required_ |
| optimizer | str | Optimizer type ('adam' or 'sgd'), determines bytes per parameter. | `'adam'` |
| checkpoint_interval_hours | float | Hours between checkpoints. | `4.0` |

#### Returns {.doc-section .doc-section-returns}

| Name | Type | Description |
|------|------|-------------|
| | CheckpointResult | Checkpoint size (GB), write time, storage bottleneck flag, and MFU penalty percentage. |