cs249r_book/mlsysim/docs/api/core.solver.WeightStreamingModel.qmd

# core.solver.WeightStreamingModel { #mlsysim.core.solver.WeightStreamingModel }

```python
core.solver.WeightStreamingModel()
```

Analyzes Wafer-Scale inference (e.g., Cerebras CS-3) using Weight Streaming.

Instead of holding weights in HBM and streaming activations (the GPU Memory Wall),
this architecture holds massive activation batches on-wafer (SRAM) and streams
the model weights from external MemoryX nodes.

The bottleneck shifts from Memory Bandwidth to Injection Interconnect Bandwidth.

Literature Source:
1. Lie et al. (2022), "Cerebras Architecture Deep Dive: First Look Inside
   the Hardware/Software Co-Design for Deep Learning."

## Methods

| Name | Description |
| --- | --- |
| [solve](#mlsysim.core.solver.WeightStreamingModel.solve) | Solves for throughput under Weight Streaming physics. |

### solve { #mlsysim.core.solver.WeightStreamingModel.solve }

```python
core.solver.WeightStreamingModel.solve(
    model,
    hardware,
    seq_len,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
)
```

Solves for throughput under Weight Streaming physics.

#### Parameters {.doc-section .doc-section-parameters}

| Name | Type | Description | Default |
|------|------|-------------|---------|
| model | TransformerWorkload | The LLM model architecture. | _required_ |
| hardware | HardwareNode | The wafer-scale hardware (e.g., Cerebras CS-3). | _required_ |
| seq_len | int | Sequence length for KV cache sizing. | _required_ |
| batch_size | int | Number of sequences processed concurrently. | `1` |
| precision | str | Numerical format (fp16, int8, int4). | `'fp16'` |
| efficiency | float | Compute utilization efficiency (0.0 to 1.0). | `0.5` |

#### Returns {.doc-section .doc-section-returns}

| Name | Type | Description |
|------|------|-------------|
| | WeightStreamingResult | Feasibility, throughput (tokens/s), bottleneck (compute vs. interconnect), layer timing, optimal batch size, and SRAM utilization. |