[PR #6905] [MERGED] runner: Set windows above normal priority for consistent CPU inference performance #74549

Closed
opened 2026-05-05 06:42:01 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/6905
Author: @dhiltgen
Created: 9/21/2024
Status: Merged
Merged: 9/21/2024
Merged by: @dhiltgen

Base: mainHead: win_cpu_perf


📝 Commits (1)

  • f248254 runner: Set windows above normal priority

📊 Changes

1 file changed (+8 additions, -2 deletions)

View changed files

📝 llm/llm_windows.go (+8 -2)

📄 Description

When running the subprocess as a background service windows may throttle, which can lead to thrashing and very poor token rate.

Fixes #3511

I've now reproduced the performance problem and understand the nature of what's going on leading to poor CPU inference performance for some users on Windows.

Windows treats GUI apps and background services differently. By default, priority is given to GUI apps ("Programs")

image

While you can change this, this wouldn't typically be recommended. The result is when the tray app is automatically started when user logs in, or by clicking "Ollama" from the start menu, it runs as a "background service" which the subprocess inherits. In some situations, this can lead to the subprocess runner being throttled, and since we try to create as many threads as there are cores, this results in thrashing behavior, and very poor token rates.

image

By setting the scheduler priority class to "above normal", this allows more CPU usage, and higher CPU load, leading to much better token rates.

image
image

I also tried setting the scheduler priority to "high" (not recommended in the API docs) and that did lead to the UI freezing, and after ~1 minute, a BSOD and reboot due to a watchdog detecting the system becoming unresponsive. The setting of "Above Normal" does not appear to have any noticeable impact on UI performance, and token rates match when the ollama server is running from a terminal (and inherits the "Program" priority). I do see the CPU utilization drop down slightly over time, so it seems Windows does penalize CPU saturating background processes so for very large context size requests there may still be room to improve performance further.

Reference:


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/6905 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 9/21/2024 **Status:** ✅ Merged **Merged:** 9/21/2024 **Merged by:** [@dhiltgen](https://github.com/dhiltgen) **Base:** `main` ← **Head:** `win_cpu_perf` --- ### 📝 Commits (1) - [`f248254`](https://github.com/ollama/ollama/commit/f248254832d0ffd1691088d088e3f0ce964b2ab9) runner: Set windows above normal priority ### 📊 Changes **1 file changed** (+8 additions, -2 deletions) <details> <summary>View changed files</summary> 📝 `llm/llm_windows.go` (+8 -2) </details> ### 📄 Description When running the subprocess as a background service windows may throttle, which can lead to thrashing and very poor token rate. Fixes #3511 I've now reproduced the performance problem and understand the nature of what's going on leading to poor CPU inference performance for some users on Windows. Windows treats GUI apps and background services differently. By default, priority is given to GUI apps ("Programs") ![image](https://github.com/user-attachments/assets/33f6d5f4-74d5-4d35-9e33-6c3d45865687) While you can change this, this wouldn't typically be recommended. The result is when the tray app is automatically started when user logs in, or by clicking "Ollama" from the start menu, it runs as a "background service" which the subprocess inherits. In some situations, this can lead to the subprocess runner being throttled, and since we try to create as many threads as there are cores, this results in thrashing behavior, and very poor token rates. ![image](https://github.com/user-attachments/assets/55207d05-4db3-4180-8d8d-9775bb531049) By setting the scheduler priority class to "above normal", this allows more CPU usage, and higher CPU load, leading to much better token rates. ![image](https://github.com/user-attachments/assets/70d958ab-3492-4419-9130-5895bbcd1294) ![image](https://github.com/user-attachments/assets/2f2c0ff8-69ad-4062-bdc1-aad28cd0235f) I also tried setting the scheduler priority to "high" (not recommended in the API docs) and that did lead to the UI freezing, and after ~1 minute, a BSOD and reboot due to a watchdog detecting the system becoming unresponsive. The setting of "Above Normal" does not appear to have any noticeable impact on UI performance, and token rates match when the ollama server is running from a terminal (and inherits the "Program" priority). I do see the CPU utilization drop down slightly over time, so it seems Windows does penalize CPU saturating background processes so for very large context size requests there may still be room to improve performance further. Reference: - https://learn.microsoft.com/en-us/windows/win32/procthread/scheduling-priorities#priority-class --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 06:42:02 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#74549