[GH-ISSUE #4477] Expose Max threads as an environment variable or set ollama to use all the cores/threads a CPU provides #2799

Closed
opened 2026-04-12 13:07:54 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @haydonryan on GitHub (May 16, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4477

Originally assigned to: @dhiltgen on GitHub.

After seeing https://github.com/ollama/ollama/issues/2929, I'm having the same issue. As I'm using both open-webui and enchanted on IOS, queries are only using half of the CPU on my EPYC 7302P.

I know you can set a /parameter when using the CLI, but I want to set this as default for serving. Alternately, is there a reason that ollama isn't using the all the available threads on of the host CPU? Seems like something that could be the default.

That said, it would be awesome to expose this as an environment variable option, for those who don't want to use the whole CPU (eg if you're running this on your desktop, while coding).

Originally created by @haydonryan on GitHub (May 16, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4477 Originally assigned to: @dhiltgen on GitHub. After seeing https://github.com/ollama/ollama/issues/2929, I'm having the same issue. As I'm using both open-webui and enchanted on IOS, queries are only using half of the CPU on my EPYC 7302P. I know you can set a /parameter when using the CLI, but I want to set this as default for serving. Alternately, is there a reason that ollama isn't using the all the available threads _on_ of the host CPU? Seems like something that _could_ be the default. That said, it would be awesome to expose this as an environment variable option, for those who don't want to use the whole CPU (eg if you're running this on your desktop, while coding).
GiteaMirror added the feature request label 2026-04-12 13:07:54 -05:00
Author
Owner

@HamzaYslmn commented on GitHub (May 24, 2024):

ı have same problem on python sdk

<!-- gh-comment-id:2130017342 --> @HamzaYslmn commented on GitHub (May 24, 2024): ı have same problem on python sdk
Author
Owner

@AlvinNorin commented on GitHub (Oct 13, 2024):

It's been a while since May. How may I go about to implement this myself? ...

<!-- gh-comment-id:2408921654 --> @AlvinNorin commented on GitHub (Oct 13, 2024): It's been a while since May. How may I go about to implement this myself? ...
Author
Owner

@dhiltgen commented on GitHub (Oct 23, 2024):

@haydonryan can you try the latest release? We've adjusted the algorithm on how we determine default threads to take into consideration sockets, cores, (perf and efficiency) and hyperthreads to try to arrive at the optimal default value. We don't have proper NUMA support yet, so we'll be limited to the physical cores in one socket.

<!-- gh-comment-id:2433450482 --> @dhiltgen commented on GitHub (Oct 23, 2024): @haydonryan can you try the latest release? We've adjusted the algorithm on how we determine default threads to take into consideration sockets, cores, (perf and efficiency) and hyperthreads to try to arrive at the optimal default value. We don't have proper NUMA support yet, so we'll be limited to the physical cores in one socket.
Author
Owner

@haydonryan commented on GitHub (Oct 24, 2024):

Absolutely - I'm happy to say it uses all of the CPU now. For every model. Thanks for this! (tested on the 16core epyc)

<!-- gh-comment-id:2435433697 --> @haydonryan commented on GitHub (Oct 24, 2024): Absolutely - I'm happy to say it uses all of the CPU now. For every model. Thanks for this! (tested on the 16core epyc)
Author
Owner

@trevorboydsmith commented on GitHub (Dec 4, 2024):

for posterity and clarification: there is no environment variable to control the number of CPU cores right now. right now the default behavior is to use all the physical cores. (this is my current understanding. please correct me otherwise. ).

i'm posting this here because the title says "Expose Max threads as an environment variable" and then the issue is closed. so i had to read the issue again and noticed the issue says "or set ollama to use all the cores".

<!-- gh-comment-id:2517760277 --> @trevorboydsmith commented on GitHub (Dec 4, 2024): for posterity and clarification: there is no environment variable to control the number of CPU cores right now. right now the default behavior is to use all the physical cores. (this is my current understanding. please correct me otherwise. ). i'm posting this here because the title says "Expose Max threads as an environment variable" and then the issue is closed. so i had to read the issue again and noticed the issue says "or set ollama to use all the cores".
Author
Owner

@zhifei92 commented on GitHub (Dec 25, 2024):

I am encountering the same issue with Ollama 0.5.4 and would like to report a problem related to the max threads configuration.

<!-- gh-comment-id:2561553002 --> @zhifei92 commented on GitHub (Dec 25, 2024): I am encountering the same issue with Ollama 0.5.4 and would like to report a problem related to the max threads configuration.
Author
Owner

@foxmulder32 commented on GitHub (Feb 13, 2025):

I have this problem right now on "AMD Ryzen 9 7900X 12-Core Processor". I know about logic and about "fake" 24 cores on this chip. But I need to have ability to use all 24 of them. This trouble force me to use llama.cpp - they use all cores and in result almost 2x faster than ollama).
Just a way to manual change --threads count in

user@jiz ~ $ ps aux | grep llama user 1103 1077 9.9 8729856 6448988 pts/3 Sl+ 12:25 98:55 /usr/local/lib/ollama/runners/cpu_avx2/ollama_llama_server runner --model /media/4tb/home/llm/.ollama/models/blobs/sha256-b3f4985f0045995ee2882b5206aad78e66d9d594041a1376dd3f844eda5a0933 --ctx-size 8192 --batch-size 512 --threads 12 --no-mmap --parallel 4 --port 34677

<!-- gh-comment-id:2656028059 --> @foxmulder32 commented on GitHub (Feb 13, 2025): I have this problem right now on "AMD Ryzen 9 7900X 12-Core Processor". I know about logic and about "fake" 24 cores on this chip. But I need to have ability to use all 24 of them. This trouble force me to use llama.cpp - they use all cores and in result almost 2x faster than ollama). Just a way to manual change --threads count in ` user@jiz ~ $ ps aux | grep llama user 1103 1077 9.9 8729856 6448988 pts/3 Sl+ 12:25 98:55 /usr/local/lib/ollama/runners/cpu_avx2/ollama_llama_server runner --model /media/4tb/home/llm/.ollama/models/blobs/sha256-b3f4985f0045995ee2882b5206aad78e66d9d594041a1376dd3f844eda5a0933 --ctx-size 8192 --batch-size 512 --threads 12 --no-mmap --parallel 4 --port 34677`
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2799