# core.solver.ServingModel { #mlsysim.core.solver.ServingModel } ```python core.solver.ServingModel() ``` Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding. LLM inference is not a single mathematical operation; it is a stateful process with two distinct physical regimes (Compute-bound Pre-fill and Memory-bound Decoding). Literature Source: 1. Pope et al. (2023), "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (Inference Bottlenecks) 2. Aminabadi et al. (2022), "DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale." 3. Yu et al. (2022), "ORCA: A Distributed Serving System for Transformer-Based Generative Models." ## Methods | Name | Description | | --- | --- | | [solve](#mlsysim.core.solver.ServingModel.solve) | Solves for LLM serving performance. | ### solve { #mlsysim.core.solver.ServingModel.solve } ```python core.solver.ServingModel.solve( model, hardware, seq_len, batch_size=1, precision='fp16', efficiency=0.5, ) ``` Solves for LLM serving performance.