mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 18:18:42 -05:00
The solver.py refactoring renamed most solver classes from *Solver to *Model (e.g. DistributedSolver → DistributedModel). The docs still referenced the old names, causing the Quarto site build to fail with: ImportError: cannot import name 'DistributedSolver' from 'mlsysim' - Fix executable code cells in tutorials/distributed.qmd - Update non-executable code examples across 10 doc files - Rename 19 API reference files from *Solver.qmd to *Model.qmd - SensitivitySolver and SynthesisSolver retain their names (correct) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
41 lines
1.1 KiB
Plaintext
41 lines
1.1 KiB
Plaintext
# core.solver.ServingModel { #mlsysim.core.solver.ServingModel }
|
|
|
|
```python
|
|
core.solver.ServingModel()
|
|
```
|
|
|
|
Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.
|
|
|
|
LLM inference is not a single mathematical operation; it is a stateful
|
|
process with two distinct physical regimes (Compute-bound Pre-fill and
|
|
Memory-bound Decoding).
|
|
|
|
Literature Source:
|
|
1. Pope et al. (2023), "LLM.int8(): 8-bit Matrix Multiplication for
|
|
Transformers at Scale" (Inference Bottlenecks)
|
|
2. Aminabadi et al. (2022), "DeepSpeed-Inference: Enabling Efficient
|
|
Inference of Transformer Models at Unprecedented Scale."
|
|
3. Yu et al. (2022), "ORCA: A Distributed Serving System for
|
|
Transformer-Based Generative Models."
|
|
|
|
## Methods
|
|
|
|
| Name | Description |
|
|
| --- | --- |
|
|
| [solve](#mlsysim.core.solver.ServingModel.solve) | Solves for LLM serving performance. |
|
|
|
|
### solve { #mlsysim.core.solver.ServingModel.solve }
|
|
|
|
```python
|
|
core.solver.ServingModel.solve(
|
|
model,
|
|
hardware,
|
|
seq_len,
|
|
batch_size=1,
|
|
precision='fp16',
|
|
efficiency=0.5,
|
|
)
|
|
```
|
|
|
|
Solves for LLM serving performance.
|