[GH-ISSUE #2496] default num_thread incorrect on some large core count system (non-hyperthreading) #47971

Closed
opened 2026-04-28 06:14:11 -05:00 by GiteaMirror · 41 comments
Owner

Originally created by @mokkin on GitHub (Feb 14, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2496

Originally assigned to: @dhiltgen on GitHub.

I have tested Ollama on different machines yet, but no matter how many cores or RAM I have, it's only using 50% of the cores and just a very few GB of RAM.
For example now I'm running ollama rum llama2:70b on 16 core server with 32 GB of RAM, but while prompting only eight cores are used and just around 1 GB of RAM.

Is there something wrong? In the models descriptions are aleways warning you neet at least 8,16,32,... GB of RAM.

Bildschirmfoto vom 2024-02-14 18-08-47

Originally created by @mokkin on GitHub (Feb 14, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2496 Originally assigned to: @dhiltgen on GitHub. I have tested Ollama on different machines yet, but no matter how many cores or RAM I have, it's only using 50% of the cores and just a very few GB of RAM. For example now I'm running `ollama rum llama2:70b` on 16 core server with 32 GB of RAM, but while prompting only eight cores are used and just around 1 GB of RAM. Is there something wrong? In the models descriptions are aleways warning you neet at least 8,16,32,... GB of RAM. ![Bildschirmfoto vom 2024-02-14 18-08-47](https://github.com/ollama/ollama/assets/2938748/8a47ec55-475d-4311-8110-3ca1e0a34cb8)
GiteaMirror added the bug label 2026-04-28 06:14:11 -05:00
Author
Owner

@easp commented on GitHub (Feb 14, 2024):

That's fine & as expected.

Model data is memory mapped and shows up in file cache #. Note too, VIRT, RES & SHR memory # of the Ollama processes.

Generation is memory bandwidth limited, not compute limited. Saturation is generally achieved ~1/2 the number of virtual cores. Using more can actually hurt speeds and interferes unnecessarily with other processes.

<!-- gh-comment-id:1944299963 --> @easp commented on GitHub (Feb 14, 2024): That's fine & as expected. Model data is memory mapped and shows up in file cache #. Note too, VIRT, RES & SHR memory # of the Ollama processes. Generation is memory bandwidth limited, not compute limited. Saturation is generally achieved ~1/2 the number of virtual cores. Using more can actually hurt speeds and interferes unnecessarily with other processes.
Author
Owner

@Zbrooklyn commented on GitHub (Feb 18, 2024):

What if I want it to use more CPU cores is there a config file or command for that?

<!-- gh-comment-id:1950598497 --> @Zbrooklyn commented on GitHub (Feb 18, 2024): What if I want it to use more CPU cores is there a config file or command for that?
Author
Owner

@robertvazan commented on GitHub (Feb 23, 2024):

@Zbrooklyn Change 'num_thread' parameter in custom modelfile.

<!-- gh-comment-id:1960747904 --> @robertvazan commented on GitHub (Feb 23, 2024): @Zbrooklyn Change 'num_thread' [parameter in custom modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter).
Author
Owner

@Cguanqin commented on GitHub (Mar 8, 2024):

@Zbrooklyn更改自定义模型文件中的“num_thread”参数

hei,bro~
it's still using 50% of the cores and 50% of RAM after modifying Modelfile that increased num_thread.from 8 to 16 。

my Modelfile is as follows:
FROM gemma:2b
PARAMETER num_thread 16

<!-- gh-comment-id:1984929693 --> @Cguanqin commented on GitHub (Mar 8, 2024): > @Zbrooklyn更改自定义模型文件中的“num_thread”[参数](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter)。 hei,bro~ it's still using 50% of the cores and 50% of RAM after modifying Modelfile that increased num_thread.from 8 to 16 。 my Modelfile is as follows: FROM gemma:2b PARAMETER num_thread 16
Author
Owner

@robertvazan commented on GitHub (Mar 8, 2024):

@Cguanqin You are probably doing something wrong. Try to write down reproducible steps. That will probably reveal the mistake.

<!-- gh-comment-id:1984933552 --> @robertvazan commented on GitHub (Mar 8, 2024): @Cguanqin You are probably doing something wrong. Try to write down reproducible steps. That will probably reveal the mistake.
Author
Owner

@Cguanqin commented on GitHub (Mar 8, 2024):

@Cguanqin You are probably doing something wrong. Try to write down reproducible steps. That will probably reveal the mistake.

oh,I don't know where the problem is. I want to run the Gemma: 2b model in Ollama. My machine is configured with 16CPU and 16RAM. Can you provide a modelfile reference? Thank you!

<!-- gh-comment-id:1984945435 --> @Cguanqin commented on GitHub (Mar 8, 2024): > @Cguanqin You are probably doing something wrong. Try to write down reproducible steps. That will probably reveal the mistake. oh,I don't know where the problem is. I want to run the Gemma: 2b model in Ollama. My machine is configured with 16CPU and 16RAM. Can you provide a modelfile reference? Thank you!
Author
Owner

@robertvazan commented on GitHub (Mar 8, 2024):

@Cguanqin Your modelfile looks alright. The problem is probably in the way you try to use it. I am using Open WebUI with Ollama, which provides convenient UI for modelfile editing. But if I were you, I wouldn't waste effort on raising thread count. I tried it and it does indeed worsen throughput. One thread per physical core is indeed optimal.

<!-- gh-comment-id:1984955319 --> @robertvazan commented on GitHub (Mar 8, 2024): @Cguanqin Your modelfile looks alright. The problem is probably in the way you try to use it. I am using Open WebUI with Ollama, which provides convenient UI for modelfile editing. But if I were you, I wouldn't waste effort on raising thread count. I tried it and it does indeed worsen throughput. One thread per physical core is indeed optimal.
Author
Owner

@AdamYLK commented on GitHub (Mar 15, 2024):

same issue for me,i find another issue
Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k)

i can't use -t to set thread on linux server

<!-- gh-comment-id:1998828746 --> @AdamYLK commented on GitHub (Mar 15, 2024): same issue for me,i find another issue [Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k)](https://github.com/ggerganov/llama.cpp/issues/842) i can't use -t to set thread on linux server
Author
Owner

@arbarbieri commented on GitHub (Apr 1, 2024):

@Zbrooklyn Change 'num_thread' parameter in custom modelfile.

Worked for me! Thanx!

<!-- gh-comment-id:2030289415 --> @arbarbieri commented on GitHub (Apr 1, 2024): > @Zbrooklyn Change 'num_thread' [parameter in custom modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter). Worked for me! Thanx!
Author
Owner

@johnalanwoods commented on GitHub (Apr 23, 2024):

Model data is memory mapped and shows up in file cache #. Note too, VIRT, RES & SHR memory # of the Ollama processes.

Sorry to necro the thread, but why is this done specifically? Are only random portions of the model used at runtime?

<!-- gh-comment-id:2073190810 --> @johnalanwoods commented on GitHub (Apr 23, 2024): > Model data is memory mapped and shows up in file cache #. Note too, VIRT, RES & SHR memory # of the Ollama processes. Sorry to necro the thread, but why is this done specifically? Are only random portions of the model used at runtime?
Author
Owner

@robertvazan commented on GitHub (Apr 23, 2024):

@johnalanwoods Not a llama.cpp/ollama developer, but let me guess. One of the advantages is that you can have model larger than RAM. It will be super slow, limited by SSD speed, but it will work. Another reason is that OS is given a chance to discard the pages when the model is not in use and load them back when model is used again. There might be other reasons.

<!-- gh-comment-id:2073645686 --> @robertvazan commented on GitHub (Apr 23, 2024): @johnalanwoods Not a llama.cpp/ollama developer, but let me guess. One of the advantages is that you can have model larger than RAM. It will be super slow, limited by SSD speed, but it will work. Another reason is that OS is given a chance to discard the pages when the model is not in use and load them back when model is used again. There might be other reasons.
Author
Owner

@johnalanwoods commented on GitHub (Apr 24, 2024):

Makes sense thanks!

<!-- gh-comment-id:2074157814 --> @johnalanwoods commented on GitHub (Apr 24, 2024): Makes sense thanks!
Author
Owner

@sekrett commented on GitHub (Apr 29, 2024):

But if I were you, I wouldn't waste effort on raising thread count. I tried it and it does indeed worsen throughput. One thread per physical core is indeed optimal.

This is true if you have hyper-threading enabled, but if disabled, only half of the CPU power is used.

<!-- gh-comment-id:2083512865 --> @sekrett commented on GitHub (Apr 29, 2024): > But if I were you, I wouldn't waste effort on raising thread count. I tried it and it does indeed worsen throughput. One thread per physical core is indeed optimal. This is true if you have hyper-threading enabled, but if disabled, only half of the CPU power is used.
Author
Owner

@dhiltgen commented on GitHub (May 2, 2024):

Yes, by default we only create a thread per physical core. Trying to map inference threads to hyperthreads thrashes the CPU and results in poorer performance.

<!-- gh-comment-id:2091796789 --> @dhiltgen commented on GitHub (May 2, 2024): Yes, by default we only create a thread per physical core. Trying to map inference threads to hyperthreads thrashes the CPU and results in poorer performance.
Author
Owner

@sekrett commented on GitHub (May 3, 2024):

I don't have hyper-treading at all (Core i5 9600K) and still 3 of 6 cores are used. How can ollama distinguish if hyper-threading is enabled?

<!-- gh-comment-id:2092448652 --> @sekrett commented on GitHub (May 3, 2024): I don't have hyper-treading at all (Core i5 9600K) and still 3 of 6 cores are used. How can ollama distinguish if hyper-threading is enabled?
Author
Owner

@dhiltgen commented on GitHub (May 3, 2024):

@sekrett the following might help shed some light...

ls /sys/devices/system/cpu/
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings

For reference, on a 4-core (8 hyperthread) Intel CPU I see something like this:

% ls /sys/devices/system/cpu/
cpu0  cpu2  cpu4  cpu6  cpufreq  hotplug       isolated    microcode  offline  possible  present  uevent
cpu1  cpu3  cpu5  cpu7  cpuidle  intel_pstate  kernel_max  modalias   online   power     smt      vulnerabilities
% cat /sys/devices/system/cpu/cpu*/topology/thread_siblings
11
22
44
88
11
22
44
88
<!-- gh-comment-id:2093513330 --> @dhiltgen commented on GitHub (May 3, 2024): @sekrett the following might help shed some light... ``` ls /sys/devices/system/cpu/ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings ``` For reference, on a 4-core (8 hyperthread) Intel CPU I see something like this: ``` % ls /sys/devices/system/cpu/ cpu0 cpu2 cpu4 cpu6 cpufreq hotplug isolated microcode offline possible present uevent cpu1 cpu3 cpu5 cpu7 cpuidle intel_pstate kernel_max modalias online power smt vulnerabilities % cat /sys/devices/system/cpu/cpu*/topology/thread_siblings 11 22 44 88 11 22 44 88 ```
Author
Owner

@wwjCMP commented on GitHub (May 3, 2024):

@sekrett the following might help shed some light...

ls /sys/devices/system/cpu/
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings

For reference, on a 4-core (8 hyperthread) Intel CPU I see something like this:

% ls /sys/devices/system/cpu/
cpu0  cpu2  cpu4  cpu6  cpufreq  hotplug       isolated    microcode  offline  possible  present  uevent
cpu1  cpu3  cpu5  cpu7  cpuidle  intel_pstate  kernel_max  modalias   online   power     smt      vulnerabilities
% cat /sys/devices/system/cpu/cpu*/topology/thread_siblings
11
22
44
88
11
22
44
88

If my machine does not have hyper-threading enabled, how can I set it up to allow more physical cores to be used by ollama?

<!-- gh-comment-id:2093812107 --> @wwjCMP commented on GitHub (May 3, 2024): > @sekrett the following might help shed some light... > > ``` > ls /sys/devices/system/cpu/ > cat /sys/devices/system/cpu/cpu*/topology/thread_siblings > ``` > > For reference, on a 4-core (8 hyperthread) Intel CPU I see something like this: > > ``` > % ls /sys/devices/system/cpu/ > cpu0 cpu2 cpu4 cpu6 cpufreq hotplug isolated microcode offline possible present uevent > cpu1 cpu3 cpu5 cpu7 cpuidle intel_pstate kernel_max modalias online power smt vulnerabilities > % cat /sys/devices/system/cpu/cpu*/topology/thread_siblings > 11 > 22 > 44 > 88 > 11 > 22 > 44 > 88 > ``` If my machine does not have hyper-threading enabled, how can I set it up to allow more physical cores to be used by ollama?
Author
Owner

@dhiltgen commented on GitHub (May 3, 2024):

@wwjCMP if your CPU does not have hyperthreads, then the "thread_siblings" above are supposed to show no siblings, and we should default to one thread per core. If that's not the behavior you're seeing, can you share the output of the above from your system?

<!-- gh-comment-id:2093838128 --> @dhiltgen commented on GitHub (May 3, 2024): @wwjCMP if your CPU does not have hyperthreads, then the "thread_siblings" above are supposed to show no siblings, and we should default to one thread per core. If that's not the behavior you're seeing, can you share the output of the above from your system?
Author
Owner

@wwjCMP commented on GitHub (May 3, 2024):

cpu0   cpu14  cpu2   cpu25  cpu30  cpu36  cpu41  cpu47  cpu52  cpu58  cpu63  cpu69  cpu74  cpu8   cpu85  cpu90  cpufreq                 kernel_max  power
cpu1   cpu15  cpu20  cpu26  cpu31  cpu37  cpu42  cpu48  cpu53  cpu59  cpu64  cpu7   cpu75  cpu80  cpu86  cpu91  cpuidle                 microcode   present
cpu10  cpu16  cpu21  cpu27  cpu32  cpu38  cpu43  cpu49  cpu54  cpu6   cpu65  cpu70  cpu76  cpu81  cpu87  cpu92  hotplug                 modalias    smt
cpu11  cpu17  cpu22  cpu28  cpu33  cpu39  cpu44  cpu5   cpu55  cpu60  cpu66  cpu71  cpu77  cpu82  cpu88  cpu93  intel_pstate            offline     uevent
cpu12  cpu18  cpu23  cpu29  cpu34  cpu4   cpu45  cpu50  cpu56  cpu61  cpu67  cpu72  cpu78  cpu83  cpu89  cpu94  intel_uncore_frequency  online      umwait_control
cpu13  cpu19  cpu24  cpu3   cpu35  cpu40  cpu46  cpu51  cpu57  cpu62  cpu68  cpu73  cpu79  cpu84  cpu9   cpu95  isolated                possible    vulnerabilities
<!-- gh-comment-id:2093873095 --> @wwjCMP commented on GitHub (May 3, 2024): ``` cpu0 cpu14 cpu2 cpu25 cpu30 cpu36 cpu41 cpu47 cpu52 cpu58 cpu63 cpu69 cpu74 cpu8 cpu85 cpu90 cpufreq kernel_max power cpu1 cpu15 cpu20 cpu26 cpu31 cpu37 cpu42 cpu48 cpu53 cpu59 cpu64 cpu7 cpu75 cpu80 cpu86 cpu91 cpuidle microcode present cpu10 cpu16 cpu21 cpu27 cpu32 cpu38 cpu43 cpu49 cpu54 cpu6 cpu65 cpu70 cpu76 cpu81 cpu87 cpu92 hotplug modalias smt cpu11 cpu17 cpu22 cpu28 cpu33 cpu39 cpu44 cpu5 cpu55 cpu60 cpu66 cpu71 cpu77 cpu82 cpu88 cpu93 intel_pstate offline uevent cpu12 cpu18 cpu23 cpu29 cpu34 cpu4 cpu45 cpu50 cpu56 cpu61 cpu67 cpu72 cpu78 cpu83 cpu89 cpu94 intel_uncore_frequency online umwait_control cpu13 cpu19 cpu24 cpu3 cpu35 cpu40 cpu46 cpu51 cpu57 cpu62 cpu68 cpu73 cpu79 cpu84 cpu9 cpu95 isolated possible vulnerabilities ```
Author
Owner

@wwjCMP commented on GitHub (May 3, 2024):

00000000,00000000,00000001
00000000,00000000,00000400
00000000,00000000,00000800
00000000,00000000,00001000
00000000,00000000,00002000
00000000,00000000,00004000
00000000,00000000,00008000
00000000,00000000,00010000
00000000,00000000,00020000
00000000,00000000,00040000
00000000,00000000,00080000
00000000,00000000,00000002
00000000,00000000,00100000
00000000,00000000,00200000
00000000,00000000,00400000
00000000,00000000,00800000
00000000,00000000,01000000
00000000,00000000,02000000
00000000,00000000,04000000
00000000,00000000,08000000
00000000,00000000,10000000
00000000,00000000,20000000
00000000,00000000,00000004
00000000,00000000,40000000
00000000,00000000,80000000
00000000,00000001,00000000
00000000,00000002,00000000
00000000,00000004,00000000
00000000,00000008,00000000
00000000,00000010,00000000
00000000,00000020,00000000
00000000,00000040,00000000
00000000,00000080,00000000
00000000,00000000,00000008
00000000,00000100,00000000
00000000,00000200,00000000
00000000,00000400,00000000
00000000,00000800,00000000
00000000,00001000,00000000
00000000,00002000,00000000
00000000,00004000,00000000
00000000,00008000,00000000
00000000,00010000,00000000
00000000,00020000,00000000
00000000,00000000,00000010
00000000,00040000,00000000
00000000,00080000,00000000
00000000,00100000,00000000
00000000,00200000,00000000
00000000,00400000,00000000
00000000,00800000,00000000
00000000,01000000,00000000
00000000,02000000,00000000
00000000,04000000,00000000
00000000,08000000,00000000
00000000,00000000,00000020
00000000,10000000,00000000
00000000,20000000,00000000
00000000,40000000,00000000
00000000,80000000,00000000
00000001,00000000,00000000
00000002,00000000,00000000
00000004,00000000,00000000
00000008,00000000,00000000
00000010,00000000,00000000
00000020,00000000,00000000
00000000,00000000,00000040
00000040,00000000,00000000
00000080,00000000,00000000
00000100,00000000,00000000
00000200,00000000,00000000
00000400,00000000,00000000
00000800,00000000,00000000
00001000,00000000,00000000
00002000,00000000,00000000
00004000,00000000,00000000
00008000,00000000,00000000
00000000,00000000,00000080
00010000,00000000,00000000
00020000,00000000,00000000
00040000,00000000,00000000
00080000,00000000,00000000
00100000,00000000,00000000
00200000,00000000,00000000
00400000,00000000,00000000
00800000,00000000,00000000
01000000,00000000,00000000
02000000,00000000,00000000
00000000,00000000,00000100
04000000,00000000,00000000
08000000,00000000,00000000
10000000,00000000,00000000
20000000,00000000,00000000
40000000,00000000,00000000
80000000,00000000,00000000
00000000,00000000,00000200
<!-- gh-comment-id:2093873184 --> @wwjCMP commented on GitHub (May 3, 2024): ``` 00000000,00000000,00000001 00000000,00000000,00000400 00000000,00000000,00000800 00000000,00000000,00001000 00000000,00000000,00002000 00000000,00000000,00004000 00000000,00000000,00008000 00000000,00000000,00010000 00000000,00000000,00020000 00000000,00000000,00040000 00000000,00000000,00080000 00000000,00000000,00000002 00000000,00000000,00100000 00000000,00000000,00200000 00000000,00000000,00400000 00000000,00000000,00800000 00000000,00000000,01000000 00000000,00000000,02000000 00000000,00000000,04000000 00000000,00000000,08000000 00000000,00000000,10000000 00000000,00000000,20000000 00000000,00000000,00000004 00000000,00000000,40000000 00000000,00000000,80000000 00000000,00000001,00000000 00000000,00000002,00000000 00000000,00000004,00000000 00000000,00000008,00000000 00000000,00000010,00000000 00000000,00000020,00000000 00000000,00000040,00000000 00000000,00000080,00000000 00000000,00000000,00000008 00000000,00000100,00000000 00000000,00000200,00000000 00000000,00000400,00000000 00000000,00000800,00000000 00000000,00001000,00000000 00000000,00002000,00000000 00000000,00004000,00000000 00000000,00008000,00000000 00000000,00010000,00000000 00000000,00020000,00000000 00000000,00000000,00000010 00000000,00040000,00000000 00000000,00080000,00000000 00000000,00100000,00000000 00000000,00200000,00000000 00000000,00400000,00000000 00000000,00800000,00000000 00000000,01000000,00000000 00000000,02000000,00000000 00000000,04000000,00000000 00000000,08000000,00000000 00000000,00000000,00000020 00000000,10000000,00000000 00000000,20000000,00000000 00000000,40000000,00000000 00000000,80000000,00000000 00000001,00000000,00000000 00000002,00000000,00000000 00000004,00000000,00000000 00000008,00000000,00000000 00000010,00000000,00000000 00000020,00000000,00000000 00000000,00000000,00000040 00000040,00000000,00000000 00000080,00000000,00000000 00000100,00000000,00000000 00000200,00000000,00000000 00000400,00000000,00000000 00000800,00000000,00000000 00001000,00000000,00000000 00002000,00000000,00000000 00004000,00000000,00000000 00008000,00000000,00000000 00000000,00000000,00000080 00010000,00000000,00000000 00020000,00000000,00000000 00040000,00000000,00000000 00080000,00000000,00000000 00100000,00000000,00000000 00200000,00000000,00000000 00400000,00000000,00000000 00800000,00000000,00000000 01000000,00000000,00000000 02000000,00000000,00000000 00000000,00000000,00000100 04000000,00000000,00000000 08000000,00000000,00000000 10000000,00000000,00000000 20000000,00000000,00000000 40000000,00000000,00000000 80000000,00000000,00000000 00000000,00000000,00000200 ```
Author
Owner

@wwjCMP commented on GitHub (May 3, 2024):

@wwjCMP if your CPU does not have hyperthreads, then the "thread_siblings" above are supposed to show no siblings, and we should default to one thread per core. If that's not the behavior you're seeing, can you share the output of the above from your system?

Here you are

<!-- gh-comment-id:2093874008 --> @wwjCMP commented on GitHub (May 3, 2024): > @wwjCMP if your CPU does not have hyperthreads, then the "thread_siblings" above are supposed to show no siblings, and we should default to one thread per core. If that's not the behavior you're seeing, can you share the output of the above from your system? Here you are
Author
Owner

@dhiltgen commented on GitHub (May 4, 2024):

Thanks for the output @wwjCMP. It does look like the thread_siblings are all unique so we should allocate 96 threads by default on this system. I'll try to figure out why this isn't happening.

As a workaround, you should be able to set num_thread to override our default behavior. For example:

% ollama run llama3
>>> /set parameter num_thread 96
Set parameter 'num_thread' to '96'
>>> why is the sky blue?
The color of the sky can vary depending on the time of day, atmospheric conditions, and location. However, under normal conditions, the sky typically appears blue
because of a phenomenon called Rayleigh scattering.
...
<!-- gh-comment-id:2093901584 --> @dhiltgen commented on GitHub (May 4, 2024): Thanks for the output @wwjCMP. It does look like the thread_siblings are all unique so we **should** allocate 96 threads by default on this system. I'll try to figure out why this isn't happening. As a workaround, you should be able to set num_thread to override our default behavior. For example: ``` % ollama run llama3 >>> /set parameter num_thread 96 Set parameter 'num_thread' to '96' >>> why is the sky blue? The color of the sky can vary depending on the time of day, atmospheric conditions, and location. However, under normal conditions, the sky typically appears blue because of a phenomenon called Rayleigh scattering. ... ```
Author
Owner

@oldgithubman commented on GitHub (May 21, 2024):

The current logic is completely borked. On my 13900K (24-core, 32-thread), ollama defaults to using four cores. If I set it to use 24 cores, it uses 16. If I set it to use 32, it uses 24. The default should be to use all the physical cores, which you say is the current default, but it isn't. If the user sets num_threads (why isn't this a global setting?), ollama should use the number of threads the user set, regardless of performance

<!-- gh-comment-id:2121760300 --> @oldgithubman commented on GitHub (May 21, 2024): The current logic is completely borked. On my 13900K (24-core, 32-thread), ollama defaults to using four cores. If I set it to use 24 cores, it uses 16. If I set it to use 32, it uses 24. The default should be to use all the physical cores, which you say *is* the current default, but it isn't. If the user sets num_threads (why isn't this a global setting?), ollama should use the number of threads the user set, *regardless of performance*
Author
Owner

@Googlepuss commented on GitHub (Jun 5, 2024):

I have ollama set up on VM for testing, with 12 vCPU (4 socket & 3 core topology) and 16GB RAM (no GPU). I am not sure where to see the global default num_thread from CLI, but open-webui indicates "2". I came to this thread looking for a reason why RAM has almost zero utilization (maybe 2-3GB of available 16), while the CPU seems to be completely taxed by every query. I could throw more resources at it, but with the RAM seemingly not used, I am wondering if there is a configuration I have overlooked. most everything is the default.

me@follama:$ ls /sys/devices/system/cpu/
cpu0 cpu10 cpu2 cpu4 cpu6 cpu8 cpufreq crash_hotplug isolated modalias offline possible present uevent
cpu1 cpu11 cpu3 cpu5 cpu7 cpu9 cpuidle hotplug kernel_max nohz_full online power smt vulnerabilities
me@follama:
$ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings
00000000,00000001
00000000,00000400
00000000,00000800
00000000,00000002
00000000,00000004
00000000,00000008
00000000,00000010
00000000,00000020
00000000,00000040
00000000,00000080
00000000,00000100
00000000,00000200

<!-- gh-comment-id:2151075214 --> @Googlepuss commented on GitHub (Jun 5, 2024): I have ollama set up on VM for testing, with 12 vCPU (4 socket & 3 core topology) and 16GB RAM (no GPU). I am not sure where to see the global default num_thread from CLI, but open-webui indicates "2". I came to this thread looking for a reason why RAM has almost zero utilization (maybe 2-3GB of available 16), while the CPU seems to be completely taxed by every query. I could throw more resources at it, but with the RAM seemingly not used, I am wondering if there is a configuration I have overlooked. most everything is the default. me@follama:~$ ls /sys/devices/system/cpu/ cpu0 cpu10 cpu2 cpu4 cpu6 cpu8 cpufreq crash_hotplug isolated modalias offline possible present uevent cpu1 cpu11 cpu3 cpu5 cpu7 cpu9 cpuidle hotplug kernel_max nohz_full online power smt vulnerabilities me@follama:~$ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings 00000000,00000001 00000000,00000400 00000000,00000800 00000000,00000002 00000000,00000004 00000000,00000008 00000000,00000010 00000000,00000020 00000000,00000040 00000000,00000080 00000000,00000100 00000000,00000200
Author
Owner

@oldgithubman commented on GitHub (Jun 5, 2024):

I have ollama set up on VM for testing, with 12 vCPU (4 socket & 3 core topology) and 16GB RAM (no GPU). I am not sure where to see the global default num_thread from CLI, but open-webui indicates "2". I came to this thread looking for a reason why RAM has almost zero utilization (maybe 2-3GB of available 16), while the CPU seems to be completely taxed by every query. I could throw more resources at it, but with the RAM seemingly not used, I am wondering if there is a configuration I have overlooked. most everything is the default.

me@follama:~ ls /sys/devices/system/cpu/ cpu0 cpu10 cpu2 cpu4 cpu6 cpu8 cpufreq crash_hotplug isolated modalias offline possible present uevent cpu1 cpu11 cpu3 cpu5 cpu7 cpu9 cpuidle hotplug kernel_max nohz_full online power smt vulnerabilities me@follama:~ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings 00000000,00000001 00000000,00000400 00000000,00000800 00000000,00000002 00000000,00000004 00000000,00000008 00000000,00000010 00000000,00000020 00000000,00000040 00000000,00000080 00000000,00000100 00000000,00000200

If I understand correctly, the low RAM usage is an illusion due to the way linux memory mapping works. Pull up a tool like glances and watch the IO to your drive. As long as it's not obviously streaming from your drive, you're probably fine

<!-- gh-comment-id:2151096468 --> @oldgithubman commented on GitHub (Jun 5, 2024): > I have ollama set up on VM for testing, with 12 vCPU (4 socket & 3 core topology) and 16GB RAM (no GPU). I am not sure where to see the global default num_thread from CLI, but open-webui indicates "2". I came to this thread looking for a reason why RAM has almost zero utilization (maybe 2-3GB of available 16), while the CPU seems to be completely taxed by every query. I could throw more resources at it, but with the RAM seemingly not used, I am wondering if there is a configuration I have overlooked. most everything is the default. > > me@follama:~$ ls /sys/devices/system/cpu/ cpu0 cpu10 cpu2 cpu4 cpu6 cpu8 cpufreq crash_hotplug isolated modalias offline possible present uevent cpu1 cpu11 cpu3 cpu5 cpu7 cpu9 cpuidle hotplug kernel_max nohz_full online power smt vulnerabilities me@follama:~$ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings 00000000,00000001 00000000,00000400 00000000,00000800 00000000,00000002 00000000,00000004 00000000,00000008 00000000,00000010 00000000,00000020 00000000,00000040 00000000,00000080 00000000,00000100 00000000,00000200 If I understand correctly, the low RAM usage is an illusion due to the way linux memory mapping works. Pull up a tool like glances and watch the IO to your drive. As long as it's not obviously streaming from your drive, you're probably fine
Author
Owner

@Googlepuss commented on GitHub (Jun 6, 2024):

Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out.

Screenshot 2024-06-05 at 7 35 03 PM Screenshot 2024-06-05 at 7 37 05 PM
<!-- gh-comment-id:2151239693 --> @Googlepuss commented on GitHub (Jun 6, 2024): Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out. <img width="1680" alt="Screenshot 2024-06-05 at 7 35 03 PM" src="https://github.com/ollama/ollama/assets/110204795/519bbd71-049b-49a7-b8f8-8c53589e3caa"> <img width="270" alt="Screenshot 2024-06-05 at 7 37 05 PM" src="https://github.com/ollama/ollama/assets/110204795/cc4ea422-1c67-4110-aa07-ebfc7e7f9559">
Author
Owner

@oldgithubman commented on GitHub (Jun 6, 2024):

Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out.

Screenshot 2024-06-05 at 7 35 03 PM Screenshot 2024-06-05 at 7 37 05 PM

Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM should show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely

Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU.

(I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize)

<!-- gh-comment-id:2151260081 --> @oldgithubman commented on GitHub (Jun 6, 2024): > Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out. > > <img width="1680" alt="Screenshot 2024-06-05 at 7 35 03 PM" src="https://github.com/ollama/ollama/assets/110204795/519bbd71-049b-49a7-b8f8-8c53589e3caa"> > <img width="270" alt="Screenshot 2024-06-05 at 7 37 05 PM" src="https://github.com/ollama/ollama/assets/110204795/cc4ea422-1c67-4110-aa07-ebfc7e7f9559"> > > Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM *should* show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU. (I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize)
Author
Owner

@Googlepuss commented on GitHub (Jun 6, 2024):

Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out.
Screenshot 2024-06-05 at 7 35 03 PM
Screenshot 2024-06-05 at 7 37 05 PM

Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM should show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely

Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU.

(I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize)

No worries at all. thanks for the feedback. yes "virtual RAM". I am interested in your remark about hyperthreading, I have 2 intel E5-2680 v3 on host and VM env managed by xcp-ng/XOA. So, the hypervisor sees "48 Core 2 Socket" and "Hyperthread enabled". But in bare metal terms, it is 2 Socket, 24 Core, and 48 Thread. With hyperthread disabled on the host, this would bring the available vCPU count from 48 to 24 (believe). is this problematic (having hyperthreading enabled)?

<!-- gh-comment-id:2151309495 --> @Googlepuss commented on GitHub (Jun 6, 2024): > > Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out. > > <img alt="Screenshot 2024-06-05 at 7 35 03 PM" width="1680" src="https://private-user-images.githubusercontent.com/110204795/337067932-519bbd71-049b-49a7-b8f8-8c53589e3caa.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTc2NDAwMjksIm5iZiI6MTcxNzYzOTcyOSwicGF0aCI6Ii8xMTAyMDQ3OTUvMzM3MDY3OTMyLTUxOWJiZDcxLTA0OWItNDlhNy1iOGY4LThjNTM1ODllM2NhYS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYwNlQwMjA4NDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05M2MxMjNmZDVjMzYyNDMyODBhOGYyZDhkODFkZWE1ZDkwNmNjYWUxOTI1MWRhYzJjNjczNDI2ZDlkODkwYjE2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.-YlO8Fbi46wt4XP0BHqsXp7CNUtaMXn2sfqYrqJzZz4"> > > <img alt="Screenshot 2024-06-05 at 7 37 05 PM" width="270" src="https://private-user-images.githubusercontent.com/110204795/337067936-cc4ea422-1c67-4110-aa07-ebfc7e7f9559.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTc2NDAwMjksIm5iZiI6MTcxNzYzOTcyOSwicGF0aCI6Ii8xMTAyMDQ3OTUvMzM3MDY3OTM2LWNjNGVhNDIyLTFjNjctNDExMC1hYTA3LWViZmM3ZTdmOTU1OS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYwNlQwMjA4NDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lZmRhOWRjNTU4MTYyNDQzZGViZmUwODEzMjk0ZjdkZTIxMzVlMDdjMTZkMGYxOWQwYzVjMTEyNTBmYjhmNzA2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.kpDbsKwgqgheRtl81CB9_VhQTA2tm5rWdJqV_7SJ2Kc"> > > Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM _should_ show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely > > Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU. > > (I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize) No worries at all. thanks for the feedback. yes "virtual RAM". I am interested in your remark about hyperthreading, I have 2 [intel E5-2680 v3](https://ark.intel.com/content/www/us/en/ark/products/81908/intel-xeon-processor-e5-2680-v3-30m-cache-2-50-ghz.html) on host and VM env managed by xcp-ng/XOA. So, the hypervisor sees "48 Core 2 Socket" and "Hyperthread enabled". But in bare metal terms, it is 2 Socket, 24 Core, and 48 Thread. With hyperthread disabled on the host, this would bring the available vCPU count from 48 to 24 (believe). is this problematic (having hyperthreading enabled)?
Author
Owner

@oldgithubman commented on GitHub (Jun 6, 2024):

Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out.
Screenshot 2024-06-05 at 7 35 03 PM
Screenshot 2024-06-05 at 7 37 05 PM

Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM should show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely

Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU.

(I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize)

No worries at all. thanks for the feedback. yes "virtual RAM". I am interested in your remark about hyperthreading, I have 2 intel E5-2680 v3 on host and VM env managed by xcp-ng/XOA. So, the hypervisor sees "48 Core 2 Socket" and "Hyperthread enabled". But in bare metal terms, it is 2 Socket, 24 Core, and 48 Thread. With hyperthread disabled on the host, this would bring the available vCPU count from 48 to 24 (believe). is this problematic (having hyperthreading enabled)?

It's not a problem for hyperthreading to be enabled, no. You'll just get better performance (usually - test it) if you restrict inference to physical cores. Also worth testing running inference just on the physical cores of one CPU, depending on memory layout, etc. You're probably losing performance from inter-CPU communication overhead. Limiting to one CPU might cut your available memory in half though. Basically, test all these things if you're looking for better performance, but again, you're much better off using a GPU if you can

<!-- gh-comment-id:2151408867 --> @oldgithubman commented on GitHub (Jun 6, 2024): > > > Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out. > > > <img alt="Screenshot 2024-06-05 at 7 35 03 PM" width="1680" src="https://private-user-images.githubusercontent.com/110204795/337067932-519bbd71-049b-49a7-b8f8-8c53589e3caa.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTc2NDAwMjksIm5iZiI6MTcxNzYzOTcyOSwicGF0aCI6Ii8xMTAyMDQ3OTUvMzM3MDY3OTMyLTUxOWJiZDcxLTA0OWItNDlhNy1iOGY4LThjNTM1ODllM2NhYS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYwNlQwMjA4NDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05M2MxMjNmZDVjMzYyNDMyODBhOGYyZDhkODFkZWE1ZDkwNmNjYWUxOTI1MWRhYzJjNjczNDI2ZDlkODkwYjE2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.-YlO8Fbi46wt4XP0BHqsXp7CNUtaMXn2sfqYrqJzZz4"> > > > <img alt="Screenshot 2024-06-05 at 7 37 05 PM" width="270" src="https://private-user-images.githubusercontent.com/110204795/337067936-cc4ea422-1c67-4110-aa07-ebfc7e7f9559.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTc2NDAwMjksIm5iZiI6MTcxNzYzOTcyOSwicGF0aCI6Ii8xMTAyMDQ3OTUvMzM3MDY3OTM2LWNjNGVhNDIyLTFjNjctNDExMC1hYTA3LWViZmM3ZTdmOTU1OS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYwNlQwMjA4NDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lZmRhOWRjNTU4MTYyNDQzZGViZmUwODEzMjk0ZjdkZTIxMzVlMDdjMTZkMGYxOWQwYzVjMTEyNTBmYjhmNzA2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.kpDbsKwgqgheRtl81CB9_VhQTA2tm5rWdJqV_7SJ2Kc"> > > > > Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM _should_ show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely > > > > Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU. > > > > (I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize) > > No worries at all. thanks for the feedback. yes "virtual RAM". I am interested in your remark about hyperthreading, I have 2 [intel E5-2680 v3](https://ark.intel.com/content/www/us/en/ark/products/81908/intel-xeon-processor-e5-2680-v3-30m-cache-2-50-ghz.html) on host and VM env managed by xcp-ng/XOA. So, the hypervisor sees "48 Core 2 Socket" and "Hyperthread enabled". But in bare metal terms, it is 2 Socket, 24 Core, and 48 Thread. With hyperthread disabled on the host, this would bring the available vCPU count from 48 to 24 (believe). is this problematic (having hyperthreading enabled)? It's not a problem for hyperthreading to be enabled, no. You'll just get better performance (usually - test it) if you restrict inference to physical cores. Also worth testing running inference just on the physical cores of one CPU, depending on memory layout, etc. You're probably losing performance from inter-CPU communication overhead. Limiting to one CPU might cut your available memory in half though. Basically, test all these things if you're looking for better performance, but again, you're much better off using a GPU if you can
Author
Owner

@Googlepuss commented on GitHub (Jun 6, 2024):

Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out.
Screenshot 2024-06-05 at 7 35 03 PM
Screenshot 2024-06-05 at 7 37 05 PM

Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM should show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely
Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU.
(I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize)

No worries at all. thanks for the feedback. yes "virtual RAM". I am interested in your remark about hyperthreading, I have 2 intel E5-2680 v3 on host and VM env managed by xcp-ng/XOA. So, the hypervisor sees "48 Core 2 Socket" and "Hyperthread enabled". But in bare metal terms, it is 2 Socket, 24 Core, and 48 Thread. With hyperthread disabled on the host, this would bring the available vCPU count from 48 to 24 (believe). is this problematic (having hyperthreading enabled)?

It's not a problem for hyperthreading to be enabled, no. You'll just get better performance (usually - test it) if you restrict inference to physical cores. Also worth testing running inference just on the physical cores of one CPU, depending on memory layout, etc. You're probably losing performance from inter-CPU communication overhead. Limiting to one CPU might cut your available memory in half though. Basically, test all these things if you're looking for better performance, but again, you're much better off using a GPU if you can

Thanks! I will take the direction and play with the CPU mappings. Agree on the GPU, just trying to get the CPU-only config optimal before I work in GPU and likely a dedicated host. Appreciate it.

<!-- gh-comment-id:2152460638 --> @Googlepuss commented on GitHub (Jun 6, 2024): > > > > Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out. > > > > <img alt="Screenshot 2024-06-05 at 7 35 03 PM" width="1680" src="https://private-user-images.githubusercontent.com/110204795/337067932-519bbd71-049b-49a7-b8f8-8c53589e3caa.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTc2NDAwMjksIm5iZiI6MTcxNzYzOTcyOSwicGF0aCI6Ii8xMTAyMDQ3OTUvMzM3MDY3OTMyLTUxOWJiZDcxLTA0OWItNDlhNy1iOGY4LThjNTM1ODllM2NhYS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYwNlQwMjA4NDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05M2MxMjNmZDVjMzYyNDMyODBhOGYyZDhkODFkZWE1ZDkwNmNjYWUxOTI1MWRhYzJjNjczNDI2ZDlkODkwYjE2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.-YlO8Fbi46wt4XP0BHqsXp7CNUtaMXn2sfqYrqJzZz4"> > > > > <img alt="Screenshot 2024-06-05 at 7 37 05 PM" width="270" src="https://private-user-images.githubusercontent.com/110204795/337067936-cc4ea422-1c67-4110-aa07-ebfc7e7f9559.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTc2NDAwMjksIm5iZiI6MTcxNzYzOTcyOSwicGF0aCI6Ii8xMTAyMDQ3OTUvMzM3MDY3OTM2LWNjNGVhNDIyLTFjNjctNDExMC1hYTA3LWViZmM3ZTdmOTU1OS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYwNlQwMjA4NDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lZmRhOWRjNTU4MTYyNDQzZGViZmUwODEzMjk0ZjdkZTIxMzVlMDdjMTZkMGYxOWQwYzVjMTEyNTBmYjhmNzA2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.kpDbsKwgqgheRtl81CB9_VhQTA2tm5rWdJqV_7SJ2Kc"> > > > > > > > > > Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM _should_ show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely > > > Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU. > > > (I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize) > > > > > > No worries at all. thanks for the feedback. yes "virtual RAM". I am interested in your remark about hyperthreading, I have 2 [intel E5-2680 v3](https://ark.intel.com/content/www/us/en/ark/products/81908/intel-xeon-processor-e5-2680-v3-30m-cache-2-50-ghz.html) on host and VM env managed by xcp-ng/XOA. So, the hypervisor sees "48 Core 2 Socket" and "Hyperthread enabled". But in bare metal terms, it is 2 Socket, 24 Core, and 48 Thread. With hyperthread disabled on the host, this would bring the available vCPU count from 48 to 24 (believe). is this problematic (having hyperthreading enabled)? > > It's not a problem for hyperthreading to be enabled, no. You'll just get better performance (usually - test it) if you restrict inference to physical cores. Also worth testing running inference just on the physical cores of one CPU, depending on memory layout, etc. You're probably losing performance from inter-CPU communication overhead. Limiting to one CPU might cut your available memory in half though. Basically, test all these things if you're looking for better performance, but again, you're much better off using a GPU if you can Thanks! I will take the direction and play with the CPU mappings. Agree on the GPU, just trying to get the CPU-only config optimal before I work in GPU and likely a dedicated host. Appreciate it.
Author
Owner

@sekrett commented on GitHub (Jun 7, 2024):

You'll just get better performance (usually - test it) if you restrict inference to physical cores. Also worth testing running inference just on the physical cores of one CPU, depending on memory layout, etc. You're probably losing performance from inter-CPU communication overhead.

Completely agree, I tested on different desktops and Xeons, without HT you get code compilation faster by setting make -j8 for example.

<!-- gh-comment-id:2154186259 --> @sekrett commented on GitHub (Jun 7, 2024): > You'll just get better performance (usually - test it) if you restrict inference to physical cores. Also worth testing running inference just on the physical cores of one CPU, depending on memory layout, etc. You're probably losing performance from inter-CPU communication overhead. Completely agree, I tested on different desktops and Xeons, without HT you get code compilation faster by setting `make -j8` for example.
Author
Owner

@jasonwang178 commented on GitHub (Jul 2, 2024):

Hi, how can I set the num_thread for ollama serve instead of ollama run?

<!-- gh-comment-id:2202088124 --> @jasonwang178 commented on GitHub (Jul 2, 2024): Hi, how can I set the `num_thread` for `ollama serve` instead of `ollama run`?
Author
Owner

@jasonwang178 commented on GitHub (Jul 2, 2024):

Hi, how can I set the num_thread for ollama serve instead of ollama run?

Alright, I found the solution for ollama serve. Simply add the num_thread parameter when making the API request.

image

Reference: https://github.com/ollama/ollama/blob/main/docs/api.md#request-6

<!-- gh-comment-id:2202179454 --> @jasonwang178 commented on GitHub (Jul 2, 2024): > Hi, how can I set the `num_thread` for `ollama serve` instead of `ollama run`? Alright, I found the solution for `ollama serve`. Simply add the `num_thread` parameter when making the API request. ![image](https://github.com/ollama/ollama/assets/222802/deb4ef32-ba60-4ad9-a978-8f0375ca72af) Reference: https://github.com/ollama/ollama/blob/main/docs/api.md#request-6
Author
Owner

@GregChiang0201 commented on GitHub (Jul 29, 2024):

Hi, how can I set the num_thread for ollama serve instead of ollama run?

Alright, I found the solution for ollama serve. Simply add the num_thread parameter when making the API request.

image

Reference: https://github.com/ollama/ollama/blob/main/docs/api.md#request-6

Excuse me, is there any option can change num_thread permanently? Since I can only run the instructions above every time to set the custom threads, or the system will use half of my cores to run it.

<!-- gh-comment-id:2255308844 --> @GregChiang0201 commented on GitHub (Jul 29, 2024): > > Hi, how can I set the `num_thread` for `ollama serve` instead of `ollama run`? > > Alright, I found the solution for `ollama serve`. Simply add the `num_thread` parameter when making the API request. > > ![image](https://private-user-images.githubusercontent.com/222802/344922090-deb4ef32-ba60-4ad9-a978-8f0375ca72af.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIyNDEwODYsIm5iZiI6MTcyMjI0MDc4NiwicGF0aCI6Ii8yMjI4MDIvMzQ0OTIyMDkwLWRlYjRlZjMyLWJhNjAtNGFkOS1hOTc4LThmMDM3NWNhNzJhZi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyOVQwODEzMDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wNmE1ZDVhYjM3ODc0ODU3MTE2NDIzMGFkMGVkYTQ3ZTc0MmY0Y2Y3Y2FiMDQyNTI0YTZkMTk1MTNkMzQ0MzM2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.SSDlNwvpSpk7INZhgFg5H3jsPml_Z0r9bCQyARBndvI) > > Reference: https://github.com/ollama/ollama/blob/main/docs/api.md#request-6 Excuse me, is there any option can change num_thread permanently? Since I can only run the instructions above every time to set the custom threads, or the system will use half of my cores to run it.
Author
Owner

@dbustosrc commented on GitHub (Jul 30, 2024):

#5554

<!-- gh-comment-id:2259105031 --> @dbustosrc commented on GitHub (Jul 30, 2024): #5554
Author
Owner

@tigran123 commented on GitHub (Oct 1, 2025):

Oh dear, how confident are we in forcing a policy, where only a mechanism should be provided, forgetting the wise words of Linus Torvalds back from early 1990s, explaining that OS (and software in general) should never force any stupid (i.e. stupid to a user who knows perhaps better than the developer) policy down all users' throat and only providing mechanisms, leaving it up to the user to decide, which policy he ought to adopt.

Now, in this case, I would say it is very naive to assume that everyone's CPU would behave in exactly the same way that a particular CPU that the developer has tested on does. The whole point of virtual threads appearing as "separate CPUs" (e.g. in /proc/cpuinfo) is that they ought to really be used as separate CPUs in the sense of MP specification. That was the case 20 years ago and it is true today. Yes, in some (in many, I agree) scenarios it would be inefficient to use more than the number of physical cores, i.e. more than half of the virtual threads, but in others it would be more efficient. Otherwise there would be no such thing as HT support.

So, please provide a way (other than passing "num_thread" via API, unless it is possible to set it via Open WebUI, is it?) to set the num_thread somehow. Again, without having to create a new model via Modelfile specifically for this purpose.

I tried setting OLLAMA_NUM_THREADS environment variable to 12 and it did make a difference only at the initial load (which became almost twice as fast, btw!) but during the actual inference it reverted back to 6. I have Intel Core i7-6800K CPU which has 6 cores and 12 threads (of course HT is enabled, as it generally makes everything TWICE as fast -- that is the whole point of HT).

I found the PR for OLLAMA_NUM_PARALLEL, but it is not merged yet:

https://github.com/ollama/ollama/pull/9546

So, are there any workarounds, please?

UPDATE: Could it be that I simply misspelled OLLAMA_NUM_THREADS and it should be OLLAMA_NUM_THREAD instead? I will try both and see.

UPDATE: No, unfortunately, I still have 50% CPU utilisation, i.e. num_thread is internally set to 6, ignoring both OLLAMA_NUM_THREAD and OLLAMA_NUM_THREADS environment variables.

<!-- gh-comment-id:3355162375 --> @tigran123 commented on GitHub (Oct 1, 2025): Oh dear, how confident are we in forcing a policy, where only a mechanism should be provided, forgetting the wise words of Linus Torvalds back from early 1990s, explaining that OS (and software in general) should never force any stupid (i.e. stupid to a user who knows perhaps better than the developer) _policy_ down all users' throat and only providing _mechanisms_, leaving it up to the user to decide, which policy he ought to adopt. Now, in this case, I would say it is very naive to assume that everyone's CPU would behave in exactly the same way that a particular CPU that the developer has tested on does. The whole point of virtual threads appearing as "separate CPUs" (e.g. in /proc/cpuinfo) is that they ought to really be used as separate CPUs in the sense of MP specification. That was the case 20 years ago and it is true today. Yes, in some (in many, I agree) scenarios it would be inefficient to use more than the number of physical cores, i.e. more than half of the virtual threads, but in others it would be more efficient. Otherwise there would be no such thing as HT support. So, please provide a way (other than passing "num_thread" via API, unless it is possible to set it via Open WebUI, is it?) to set the num_thread somehow. Again, without having to create a new model via Modelfile specifically for this purpose. I tried setting OLLAMA_NUM_THREADS environment variable to 12 and it did make a difference only at the initial load (which became almost twice as fast, btw!) but during the actual inference it reverted back to 6. I have Intel Core i7-6800K CPU which has 6 cores and 12 threads (of course HT is enabled, as it generally makes everything TWICE as fast -- that is the whole point of HT). I found the PR for OLLAMA_NUM_PARALLEL, but it is not merged yet: https://github.com/ollama/ollama/pull/9546 So, are there any workarounds, please? UPDATE: Could it be that I simply misspelled OLLAMA_NUM_THREADS and it should be OLLAMA_NUM_THREAD instead? I will try both and see. UPDATE: No, unfortunately, I still have 50% CPU utilisation, i.e. num_thread is internally set to 6, ignoring both OLLAMA_NUM_THREAD and OLLAMA_NUM_THREADS environment variables.
Author
Owner

@tigran123 commented on GitHub (Oct 1, 2025):

UPDATE: I was reluctant to do ollama create model -f modelfile because I thought that it was going to duplicate the whole 65GB file :)

Now I have created a num_thread 12 version of gpt-oss:120b and compared its performance to the default one -- indeed, it is twice as slow -- 40 seconds vs 21 seconds for inference on the same prompt (from fresh state of ollama serve).

Thank you for your patience :)

<!-- gh-comment-id:3358068114 --> @tigran123 commented on GitHub (Oct 1, 2025): UPDATE: I was reluctant to do `ollama create model -f modelfile` because I thought that it was going to duplicate the whole 65GB file :) Now I have created a `num_thread 12` version of gpt-oss:120b and compared its performance to the default one -- indeed, it is twice as slow -- 40 seconds vs 21 seconds for inference on the same prompt (from fresh state of `ollama serve`). Thank you for your patience :)
Author
Owner

@tigran123 commented on GitHub (Oct 5, 2025):

Another update: I don't know how did I get those "40 vs 21 seconds" numbers before, but now I consistently get the opposite, namely: 17-19 seconds for the 12 threads version (100% cpu utilisation) and 28-31 seconds for the 6 threads version (50% CPU utilisation). Here are the screenshots and I can easily reproduce it. So, the default num_thread is set wrongly, after all. It should be set to the logically obvious value -- the number of CPUs (i.e. virtual threads) and not the number of physical cores as it does currently. At least on my Intel Core i7-6800K system with 128GB RAM.

Image Image
<!-- gh-comment-id:3369325526 --> @tigran123 commented on GitHub (Oct 5, 2025): Another update: I don't know how did I get those "40 vs 21 seconds" numbers before, but now I _consistently_ get the opposite, namely: 17-19 seconds for the 12 threads version (100% cpu utilisation) and 28-31 seconds for the 6 threads version (50% CPU utilisation). Here are the screenshots and I can easily reproduce it. So, the default num_thread is set wrongly, after all. It should be set to the logically obvious value -- the number of CPUs (i.e. virtual threads) and not the number of physical cores as it does currently. At least on my Intel Core i7-6800K system with 128GB RAM. <img width="1719" height="360" alt="Image" src="https://github.com/user-attachments/assets/f585cfeb-9417-4370-b1d4-760edde88dfb" /> <img width="1719" height="360" alt="Image" src="https://github.com/user-attachments/assets/dd1a1b42-41c7-4536-9b6a-758b1b84c78d" />
Author
Owner

@minyor commented on GitHub (Oct 10, 2025):

Hello, I have same problem on our E5-2699 v3 server with 70+ cores.
Small models on ollama version is 0.1.38 loads and works fast under 4 seconds per request
But starting from ollama version 0.1.39 and up to latest version the loading and resposes take up to 15 minutes...
Setting OLLAMA_NUM_THREADS or OLLAMA_NUM_THREAD nor num_thread to 70 wont help..
Please adwise, are we really stack with this old ollama?

<!-- gh-comment-id:3389251779 --> @minyor commented on GitHub (Oct 10, 2025): Hello, I have same problem on our E5-2699 v3 server with 70+ cores. Small models on ollama version is 0.1.38 loads and works fast under 4 seconds per request But starting from ollama version 0.1.39 and up to latest version the loading and resposes take up to 15 minutes... Setting OLLAMA_NUM_THREADS or OLLAMA_NUM_THREAD nor num_thread to 70 wont help.. Please adwise, are we really stack with this old ollama?
Author
Owner

@tigran123 commented on GitHub (Oct 10, 2025):

Why not do what I did -- have a trivial Modelfile like this:

FROM modelname
PARAMETER num_thread <N>

then it works perfectly if num_thread is matching the number of real CPUs in the SMP sense (not some secondary notion of the so-called "physical cores" which is not useful and that is why it is not normally exported via API -- though one can manually count the core id values in /proc/cpuinfo, of course).

It is sad that the Ollama developers are stubbornly refusing to change the defaults, presumably because some exotic tests done by someone suggested that using only half of the available CPUs is somehow beneficial. But this is not critical -- just create a proper model clone for each of your models (i.e. with num_thread matching the number of CPUs) and everything works fine, and much faster than with this unfortunate default.

<!-- gh-comment-id:3389283066 --> @tigran123 commented on GitHub (Oct 10, 2025): Why not do what I did -- have a trivial Modelfile like this: ``` FROM modelname PARAMETER num_thread <N> ``` then it works perfectly if `num_thread` is matching the number of real CPUs in the SMP sense (not some secondary notion of the so-called "physical cores" which is not useful and that is why it is not normally exported via API -- though one can manually count the `core id` values in `/proc/cpuinfo`, of course). It is sad that the Ollama developers are stubbornly refusing to change the defaults, presumably because some exotic tests done by someone suggested that using only half of the available CPUs is somehow beneficial. But this is not critical -- just create a proper model clone for each of your models (i.e. with num_thread matching the number of CPUs) and everything works fine, and much faster than with this unfortunate default.
Author
Owner

@minyor commented on GitHub (Oct 10, 2025):

But I tried this, I modified existing model parameter like so:

ollama run tinyllama:1.1b

 /set parameter num_thread 72
 /save tinyllama:1.1b
 Ctrl+D

service ollama restart

But this only made my model run slow even on an 0.1.38 version of ollama, not only on >=0.1.39
Only deleting a model and redownloading it helped to make it run fast again on 0.1.38
And I do not mean a little slow, but about 90 times slower...
I surelly must've did something incorrect

Edit: Also, when I run requests, all the 72 cores are at 100% of load, whole these 15 slow minutes...
If this is a problem with number of threads then shouldn't only one or some cores be loaded instead of all of them?

Edit: If I edit the model again but this time set num_threads 0 likeso:

ollama run tinyllama:1.1b

 /set parameter num_thread 0
 /save tinyllama:1.1b
 Ctrl+D

service ollama restart

Then It again starts to work fast but only in ollama <=0.1.38

<!-- gh-comment-id:3389333660 --> @minyor commented on GitHub (Oct 10, 2025): But I tried this, I modified existing model parameter like so: >ollama run tinyllama:1.1b >> /set parameter num_thread 72 >> /save tinyllama:1.1b >> Ctrl+D >service ollama restart But this only made my model run slow even on an 0.1.38 version of ollama, not only on >=0.1.39 Only deleting a model and redownloading it helped to make it run fast again on 0.1.38 And I do not mean a little slow, but about 90 times slower... I surelly must've did something incorrect Edit: Also, when I run requests, all the 72 cores are at 100% of load, whole these 15 slow minutes... If this is a problem with number of threads then shouldn't only one or some cores be loaded instead of all of them? Edit: If I edit the model again but this time set num_threads 0 likeso: >ollama run tinyllama:1.1b >> /set parameter num_thread 0 >> /save tinyllama:1.1b >> Ctrl+D >service ollama restart Then It again starts to work fast but only in ollama <=0.1.38
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47971