cs249r_book/mlsysim/docs/api/core.solver.ServingModel.qmd

# core.solver.ServingModel { #mlsysim.core.solver.ServingModel }

```python
core.solver.ServingModel()
```

Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.

LLM inference is not a single mathematical operation; it is a stateful
process with two distinct physical regimes (Compute-bound Pre-fill and
Memory-bound Decoding).

Literature Source:
1. Pope et al. (2023), "LLM.int8(): 8-bit Matrix Multiplication for
   Transformers at Scale" (Inference Bottlenecks)
2. Aminabadi et al. (2022), "DeepSpeed-Inference: Enabling Efficient
   Inference of Transformer Models at Unprecedented Scale."
3. Yu et al. (2022), "ORCA: A Distributed Serving System for
   Transformer-Based Generative Models."

## Methods

| Name | Description |
| --- | --- |
| [solve](#mlsysim.core.solver.ServingModel.solve) | Solves for LLM serving performance. |

### solve { #mlsysim.core.solver.ServingModel.solve }

```python
core.solver.ServingModel.solve(
    model,
    hardware,
    seq_len,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
)
```

Solves for LLM serving performance.