# core.solver.WeightStreamingModel { #mlsysim.core.solver.WeightStreamingModel } ```python core.solver.WeightStreamingModel() ``` Analyzes Wafer-Scale inference (e.g., Cerebras CS-3) using Weight Streaming. Instead of holding weights in HBM and streaming activations (the GPU Memory Wall), this architecture holds massive activation batches on-wafer (SRAM) and streams the model weights from external MemoryX nodes. The bottleneck shifts from Memory Bandwidth to Injection Interconnect Bandwidth. Literature Source: 1. Lie et al. (2022), "Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning." ## Methods | Name | Description | | --- | --- | | [solve](#mlsysim.core.solver.WeightStreamingModel.solve) | Solves for throughput under Weight Streaming physics. | ### solve { #mlsysim.core.solver.WeightStreamingModel.solve } ```python core.solver.WeightStreamingModel.solve( model, hardware, seq_len, batch_size=1, precision='fp16', efficiency=0.5, ) ``` Solves for throughput under Weight Streaming physics. #### Parameters {.doc-section .doc-section-parameters} | Name | Type | Description | Default | |------|------|-------------|---------| | model | TransformerWorkload | The LLM model architecture. | _required_ | | hardware | HardwareNode | The wafer-scale hardware (e.g., Cerebras CS-3). | _required_ | | seq_len | int | Sequence length for KV cache sizing. | _required_ | | batch_size | int | Number of sequences processed concurrently. | `1` | | precision | str | Numerical format (fp16, int8, int4). | `'fp16'` | | efficiency | float | Compute utilization efficiency (0.0 to 1.0). | `0.5` | #### Returns {.doc-section .doc-section-returns} | Name | Type | Description | |------|------|-------------| | | WeightStreamingResult | Feasibility, throughput (tokens/s), bottleneck (compute vs. interconnect), layer timing, optimal batch size, and SRAM utilization. |