[PR #10596] CPU Model Performance Optimization for Ollama #13290

Open
opened 2026-04-13 00:23:01 -05:00 by GiteaMirror · 0 comments
Owner

Original Pull Request: https://github.com/ollama/ollama/pull/10596

State: open
Merged: No


CPU Model Performance Optimization for Ollama

I've implemented a solution to address the high latency issue for the first tokens when running models in CPU mode.

What Was Implemented

  1. Created a ModelPreloader Component

    • Added a background process that actively keeps model data in memory
    • Implemented periodic memory page touching to prevent swapping
    • Uses the system-wide OLLAMA_KEEP_ALIVE setting for inactivity timeout
  2. Made the Feature Optional with Environment Variable

    • Added OLLAMA_PRELOAD_CPU_MODEL environment variable
    • Feature is disabled by default for compatibility and deciding later if is made permanent
    • Users can explicitly enable it when needed
  3. Integrated with Existing Ollama Settings

    • Uses the same global timeout (OLLAMA_KEEP_ALIVE) that controls how long models stay loaded
    • Consistency with existing Ollama behavior and configuration
    • Respects user preferences for resource management

How to Use the Feature

To enable CPU model preloading and reduce first token latency:

OLLAMA_PRELOAD_CPU_MODEL=1 ollama run <model>

The preloader will keep the model in memory for the duration specified by OLLAMA_KEEP_ALIVE (defaults to 5 minutes of inactivity), which can also be customized:

OLLAMA_PRELOAD_CPU_MODEL=1 OLLAMA_KEEP_ALIVE=10m ollama run <model>

When enabled, Ollama will:

  1. Load the model as usual using memory-mapped files
  2. Actively read through the entire model file to ensure pages are in RAM
  3. Keep the model data in memory until the inactivity timeout is reached
  4. Clean up resources automatically when the model becomes inactive

This dramatically reduces the latency experienced during first token generation while still using memory-mapped files efficiently.

**Original Pull Request:** https://github.com/ollama/ollama/pull/10596 **State:** open **Merged:** No --- # CPU Model Performance Optimization for Ollama I've implemented a solution to address the high latency issue for the first tokens when running models in CPU mode. ## What Was Implemented 1. __Created a ModelPreloader Component__ - Added a background process that actively keeps model data in memory - Implemented periodic memory page touching to prevent swapping - Uses the system-wide `OLLAMA_KEEP_ALIVE` setting for inactivity timeout 2. __Made the Feature Optional with Environment Variable__ - Added `OLLAMA_PRELOAD_CPU_MODEL` environment variable - Feature is disabled by default for compatibility and deciding later if is made permanent - Users can explicitly enable it when needed 3. __Integrated with Existing Ollama Settings__ - Uses the same global timeout (`OLLAMA_KEEP_ALIVE`) that controls how long models stay loaded - Consistency with existing Ollama behavior and configuration - Respects user preferences for resource management ## How to Use the Feature To enable CPU model preloading and reduce first token latency: ```bash OLLAMA_PRELOAD_CPU_MODEL=1 ollama run <model> ``` The preloader will keep the model in memory for the duration specified by `OLLAMA_KEEP_ALIVE` (defaults to 5 minutes of inactivity), which can also be customized: ```bash OLLAMA_PRELOAD_CPU_MODEL=1 OLLAMA_KEEP_ALIVE=10m ollama run <model> ``` When enabled, Ollama will: 1. Load the model as usual using memory-mapped files 2. Actively read through the entire model file to ensure pages are in RAM 3. Keep the model data in memory until the inactivity timeout is reached 4. Clean up resources automatically when the model becomes inactive This dramatically reduces the latency experienced during first token generation while still using memory-mapped files efficiently.
GiteaMirror added the pull-request label 2026-04-13 00:23:01 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#13290