[PR #10596] CPU Model Performance Optimization for Ollama #23830

Open
opened 2026-04-19 17:14:40 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/10596
Author: @WingsDrafterwork
Created: 5/6/2025
Status: 🔄 Open

Base: mainHead: better_cpu_memory


📝 Commits (8)

  • d36383c Optimize CPU model performance by implementing model preloading to reduce first token latency
  • dad09a0 Make CPU model preloading optional with OLLAMA_PRELOAD_CPU_MODEL env var
  • 19dac72 Use system-wide KeepAlive setting for model preloader inactivity timeout
  • 54db987 Merge branch 'ollama:main' into better_cpu_memory
  • 56b9058 Merge branch 'main' into better_cpu_memory
  • aba02a5 Optimize CPU model performance by implementing model preloading to reduce first token latency
  • ee3eed7 Use system-wide KeepAlive setting for model preloader inactivity timeout
  • c488a55 Implement memory locking for model and KV cache to prevent page faults

📊 Changes

4 files changed (+475 additions, -19 deletions)

View changed files

📝 envconfig/config.go (+3 -0)
📝 kvcache/causal.go (+73 -0)
llm/preload.go (+272 -0)
📝 llm/server.go (+127 -19)

📄 Description

CPU Model Performance Optimization for Ollama

I've implemented a solution to address the high latency issue for the first tokens when running models in CPU mode.

What Was Implemented

  1. Created a ModelPreloader Component

    • Added a background process that actively keeps model data in memory
    • Implemented periodic memory page touching to prevent swapping
    • Uses the system-wide OLLAMA_KEEP_ALIVE setting for inactivity timeout
  2. Made the Feature Optional with Environment Variable

    • Added OLLAMA_PRELOAD_CPU_MODEL environment variable
    • Feature is disabled by default for compatibility and deciding later if is made permanent
    • Users can explicitly enable it when needed
  3. Integrated with Existing Ollama Settings

    • Uses the same global timeout (OLLAMA_KEEP_ALIVE) that controls how long models stay loaded
    • Consistency with existing Ollama behavior and configuration
    • Respects user preferences for resource management

How to Use the Feature

To enable CPU model preloading and reduce first token latency:

OLLAMA_PRELOAD_CPU_MODEL=1 ollama run <model>

The preloader will keep the model in memory for the duration specified by OLLAMA_KEEP_ALIVE (defaults to 5 minutes of inactivity), which can also be customized:

OLLAMA_PRELOAD_CPU_MODEL=1 OLLAMA_KEEP_ALIVE=10m ollama run <model>

When enabled, Ollama will:

  1. Load the model as usual using memory-mapped files
  2. Actively read through the entire model file to ensure pages are in RAM
  3. Keep the model data in memory until the inactivity timeout is reached
  4. Clean up resources automatically when the model becomes inactive

This dramatically reduces the latency experienced during first token generation while still using memory-mapped files efficiently.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/10596 **Author:** [@WingsDrafterwork](https://github.com/WingsDrafterwork) **Created:** 5/6/2025 **Status:** 🔄 Open **Base:** `main` ← **Head:** `better_cpu_memory` --- ### 📝 Commits (8) - [`d36383c`](https://github.com/ollama/ollama/commit/d36383c00c114fb32690cd4c4bc1c0b03feeff52) Optimize CPU model performance by implementing model preloading to reduce first token latency - [`dad09a0`](https://github.com/ollama/ollama/commit/dad09a08f4b5fa606f0fc307623c84e867fb85fc) Make CPU model preloading optional with OLLAMA_PRELOAD_CPU_MODEL env var - [`19dac72`](https://github.com/ollama/ollama/commit/19dac72b74cf2b4a78e67648f21ff61477d059c7) Use system-wide KeepAlive setting for model preloader inactivity timeout - [`54db987`](https://github.com/ollama/ollama/commit/54db9872e7f6b698734ed6118e257e7bdb136507) Merge branch 'ollama:main' into better_cpu_memory - [`56b9058`](https://github.com/ollama/ollama/commit/56b9058c97c93e4267a19cf7752c3d2f03b75a13) Merge branch 'main' into better_cpu_memory - [`aba02a5`](https://github.com/ollama/ollama/commit/aba02a5b30038f02f8a8c5fc766dbf94a5b02965) Optimize CPU model performance by implementing model preloading to reduce first token latency - [`ee3eed7`](https://github.com/ollama/ollama/commit/ee3eed70c49b3d210e309b62ade8ab6ddf8d7ac6) Use system-wide KeepAlive setting for model preloader inactivity timeout - [`c488a55`](https://github.com/ollama/ollama/commit/c488a55962967056ebcf9954f8d9f6b645785721) Implement memory locking for model and KV cache to prevent page faults ### 📊 Changes **4 files changed** (+475 additions, -19 deletions) <details> <summary>View changed files</summary> 📝 `envconfig/config.go` (+3 -0) 📝 `kvcache/causal.go` (+73 -0) ➕ `llm/preload.go` (+272 -0) 📝 `llm/server.go` (+127 -19) </details> ### 📄 Description # CPU Model Performance Optimization for Ollama I've implemented a solution to address the high latency issue for the first tokens when running models in CPU mode. ## What Was Implemented 1. __Created a ModelPreloader Component__ - Added a background process that actively keeps model data in memory - Implemented periodic memory page touching to prevent swapping - Uses the system-wide `OLLAMA_KEEP_ALIVE` setting for inactivity timeout 2. __Made the Feature Optional with Environment Variable__ - Added `OLLAMA_PRELOAD_CPU_MODEL` environment variable - Feature is disabled by default for compatibility and deciding later if is made permanent - Users can explicitly enable it when needed 3. __Integrated with Existing Ollama Settings__ - Uses the same global timeout (`OLLAMA_KEEP_ALIVE`) that controls how long models stay loaded - Consistency with existing Ollama behavior and configuration - Respects user preferences for resource management ## How to Use the Feature To enable CPU model preloading and reduce first token latency: ```bash OLLAMA_PRELOAD_CPU_MODEL=1 ollama run <model> ``` The preloader will keep the model in memory for the duration specified by `OLLAMA_KEEP_ALIVE` (defaults to 5 minutes of inactivity), which can also be customized: ```bash OLLAMA_PRELOAD_CPU_MODEL=1 OLLAMA_KEEP_ALIVE=10m ollama run <model> ``` When enabled, Ollama will: 1. Load the model as usual using memory-mapped files 2. Actively read through the entire model file to ensure pages are in RAM 3. Keep the model data in memory until the inactivity timeout is reached 4. Clean up resources automatically when the model becomes inactive This dramatically reduces the latency experienced during first token generation while still using memory-mapped files efficiently. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 17:14:40 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#23830