Files
cs249r_book/mlsysim/docs/api/core.solver.ContinuousBatchingSolver.qmd
Vijay Janapa Reddi 2bbe3e1a69 docs(mlsysim): redesign website, add 12 tutorials, and CLI entry points
Replace 9 old tutorials with 12 new numbered tutorials (00-11) covering
roofline through full-stack audit. Redesign landing page, add
models-and-solvers and extending-the-engine guides. Add __main__.py,
cli.py, and cli/ package for command-line interface.
2026-03-12 16:04:51 -04:00

58 lines
2.1 KiB
Plaintext

# core.solver.ContinuousBatchingModel { #mlsysim.core.solver.ContinuousBatchingModel }
```python
core.solver.ContinuousBatchingModel()
```
Analyzes production LLM serving with Continuous Batching and PagedAttention.
Traditional static batching suffers from severe memory fragmentation and
padding waste. This solver models the throughput improvements achieved by
iteration-level scheduling and non-contiguous KV cache allocation.
Literature Source:
1. Kwon et al. (2023), "Efficient Memory Management for Large Language Model
Serving with PagedAttention."
2. Yu et al. (2022), "ORCA: A Distributed Serving System for
Transformer-Based Generative Models."
## Methods
| Name | Description |
| --- | --- |
| [solve](#mlsysim.core.solver.ContinuousBatchingModel.solve) | Solves for continuous batching throughput and PagedAttention memory. |
### solve { #mlsysim.core.solver.ContinuousBatchingModel.solve }
```python
core.solver.ContinuousBatchingModel.solve(
model,
hardware,
seq_len,
max_batch_size=1,
page_size=16,
precision='fp16',
efficiency=0.5,
)
```
Solves for continuous batching throughput and PagedAttention memory.
#### Parameters {.doc-section .doc-section-parameters}
| Name | Type | Description | Default |
|------|------|-------------|---------|
| model | TransformerWorkload | The LLM model architecture. | _required_ |
| hardware | HardwareNode | The target hardware for inference. | _required_ |
| seq_len | int | The total context window (prompt + generated tokens). | _required_ |
| max_batch_size | int | Maximum concurrent requests in the batch. | `1` |
| page_size | int | Tokens per KV cache page (PagedAttention granularity). | `16` |
| precision | str | Numerical format (fp16, int8, int4). | `'fp16'` |
| efficiency | float | Compute utilization efficiency (0.0 to 1.0). | `0.5` |
#### Returns {.doc-section .doc-section-returns}
| Name | Type | Description |
|------|------|-------------|
| | ContinuousBatchingResult | Throughput (tokens/s), max active requests, memory fragmentation, TTFT, ITL, and speedup vs. static batching. |