mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 18:18:42 -05:00
The solver.py refactoring renamed most solver classes from *Solver to *Model (e.g. DistributedSolver → DistributedModel). The docs still referenced the old names, causing the Quarto site build to fail with: ImportError: cannot import name 'DistributedSolver' from 'mlsysim' - Fix executable code cells in tutorials/distributed.qmd - Update non-executable code examples across 10 doc files - Rename 19 API reference files from *Solver.qmd to *Model.qmd - SensitivitySolver and SynthesisSolver retain their names (correct) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
58 lines
2.1 KiB
Plaintext
58 lines
2.1 KiB
Plaintext
# core.solver.ContinuousBatchingModel { #mlsysim.core.solver.ContinuousBatchingModel }
|
|
|
|
```python
|
|
core.solver.ContinuousBatchingModel()
|
|
```
|
|
|
|
Analyzes production LLM serving with Continuous Batching and PagedAttention.
|
|
|
|
Traditional static batching suffers from severe memory fragmentation and
|
|
padding waste. This solver models the throughput improvements achieved by
|
|
iteration-level scheduling and non-contiguous KV cache allocation.
|
|
|
|
Literature Source:
|
|
1. Kwon et al. (2023), "Efficient Memory Management for Large Language Model
|
|
Serving with PagedAttention."
|
|
2. Yu et al. (2022), "ORCA: A Distributed Serving System for
|
|
Transformer-Based Generative Models."
|
|
|
|
## Methods
|
|
|
|
| Name | Description |
|
|
| --- | --- |
|
|
| [solve](#mlsysim.core.solver.ContinuousBatchingModel.solve) | Solves for continuous batching throughput and PagedAttention memory. |
|
|
|
|
### solve { #mlsysim.core.solver.ContinuousBatchingModel.solve }
|
|
|
|
```python
|
|
core.solver.ContinuousBatchingModel.solve(
|
|
model,
|
|
hardware,
|
|
seq_len,
|
|
max_batch_size=1,
|
|
page_size=16,
|
|
precision='fp16',
|
|
efficiency=0.5,
|
|
)
|
|
```
|
|
|
|
Solves for continuous batching throughput and PagedAttention memory.
|
|
|
|
#### Parameters {.doc-section .doc-section-parameters}
|
|
|
|
| Name | Type | Description | Default |
|
|
|------|------|-------------|---------|
|
|
| model | TransformerWorkload | The LLM model architecture. | _required_ |
|
|
| hardware | HardwareNode | The target hardware for inference. | _required_ |
|
|
| seq_len | int | The total context window (prompt + generated tokens). | _required_ |
|
|
| max_batch_size | int | Maximum concurrent requests in the batch. | `1` |
|
|
| page_size | int | Tokens per KV cache page (PagedAttention granularity). | `16` |
|
|
| precision | str | Numerical format (fp16, int8, int4). | `'fp16'` |
|
|
| efficiency | float | Compute utilization efficiency (0.0 to 1.0). | `0.5` |
|
|
|
|
#### Returns {.doc-section .doc-section-returns}
|
|
|
|
| Name | Type | Description |
|
|
|------|------|-------------|
|
|
| | ContinuousBatchingResult | Throughput (tokens/s), max active requests, memory fragmentation, TTFT, ITL, and speedup vs. static batching. |
|