Files
cs249r_book/mlsysim/docs/api/core.solver.ServingModel.qmd
Vijay Janapa Reddi 611de228d9 fix(mlsysim): align docs with *Model naming convention
The solver.py refactoring renamed most solver classes from *Solver to
*Model (e.g. DistributedSolver → DistributedModel). The docs still
referenced the old names, causing the Quarto site build to fail with:
  ImportError: cannot import name 'DistributedSolver' from 'mlsysim'

- Fix executable code cells in tutorials/distributed.qmd
- Update non-executable code examples across 10 doc files
- Rename 19 API reference files from *Solver.qmd to *Model.qmd
- SensitivitySolver and SynthesisSolver retain their names (correct)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 08:39:11 -04:00

41 lines
1.1 KiB
Plaintext

# core.solver.ServingModel { #mlsysim.core.solver.ServingModel }
```python
core.solver.ServingModel()
```
Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.
LLM inference is not a single mathematical operation; it is a stateful
process with two distinct physical regimes (Compute-bound Pre-fill and
Memory-bound Decoding).
Literature Source:
1. Pope et al. (2023), "LLM.int8(): 8-bit Matrix Multiplication for
Transformers at Scale" (Inference Bottlenecks)
2. Aminabadi et al. (2022), "DeepSpeed-Inference: Enabling Efficient
Inference of Transformer Models at Unprecedented Scale."
3. Yu et al. (2022), "ORCA: A Distributed Serving System for
Transformer-Based Generative Models."
## Methods
| Name | Description |
| --- | --- |
| [solve](#mlsysim.core.solver.ServingModel.solve) | Solves for LLM serving performance. |
### solve { #mlsysim.core.solver.ServingModel.solve }
```python
core.solver.ServingModel.solve(
model,
hardware,
seq_len,
batch_size=1,
precision='fp16',
efficiency=0.5,
)
```
Solves for LLM serving performance.