[GH-ISSUE #2496] default num_thread incorrect on some large core count system (non-hyperthreading) #63497

New Issue

GiteaMirror · 2026-05-03T13:49:59-05:00

GiteaMirror commented

2026-05-03 13:49:59 -05:00

Originally created by @mokkin on GitHub (Feb 14, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2496

Originally assigned to: @dhiltgen on GitHub.

I have tested Ollama on different machines yet, but no matter how many cores or RAM I have, it's only using 50% of the cores and just a very few GB of RAM.
For example now I'm running ollama rum llama2:70b on 16 core server with 32 GB of RAM, but while prompting only eight cores are used and just around 1 GB of RAM.

Is there something wrong? In the models descriptions are aleways warning you neet at least 8,16,32,... GB of RAM.

Originally created by @mokkin on GitHub (Feb 14, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2496 Originally assigned to: @dhiltgen on GitHub. I have tested Ollama on different machines yet, but no matter how many cores or RAM I have, it's only using 50% of the cores and just a very few GB of RAM. For example now I'm running `ollama rum llama2:70b` on 16 core server with 32 GB of RAM, but while prompting only eight cores are used and just around 1 GB of RAM. Is there something wrong? In the models descriptions are aleways warning you neet at least 8,16,32,... GB of RAM. ![Bildschirmfoto vom 2024-02-14 18-08-47](https://github.com/ollama/ollama/assets/2938748/8a47ec55-475d-4311-8110-3ca1e0a34cb8)

GiteaMirror added the bug label 2026-05-03 13:49:59 -05:00

GiteaMirror closed this issue

2026-05-03 13:50:02 -05:00

GiteaMirror commented

2026-05-03 13:50:05 -05:00

@easp commented on GitHub (Feb 14, 2024):

That's fine & as expected.

Model data is memory mapped and shows up in file cache #. Note too, VIRT, RES & SHR memory # of the Ollama processes.

Generation is memory bandwidth limited, not compute limited. Saturation is generally achieved ~1/2 the number of virtual cores. Using more can actually hurt speeds and interferes unnecessarily with other processes.

@easp commented on GitHub (Feb 14, 2024): That's fine & as expected. Model data is memory mapped and shows up in file cache #. Note too, VIRT, RES & SHR memory # of the Ollama processes. Generation is memory bandwidth limited, not compute limited. Saturation is generally achieved ~1/2 the number of virtual cores. Using more can actually hurt speeds and interferes unnecessarily with other processes.

GiteaMirror commented

2026-05-03 13:50:08 -05:00

@Zbrooklyn commented on GitHub (Feb 18, 2024):

What if I want it to use more CPU cores is there a config file or command for that?

@Zbrooklyn commented on GitHub (Feb 18, 2024): What if I want it to use more CPU cores is there a config file or command for that?

GiteaMirror commented

2026-05-03 13:50:09 -05:00

@robertvazan commented on GitHub (Feb 23, 2024):

@Zbrooklyn Change 'num_thread' parameter in custom modelfile.

@robertvazan commented on GitHub (Feb 23, 2024): @Zbrooklyn Change 'num_thread' [parameter in custom modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter).

GiteaMirror commented

2026-05-03 13:50:11 -05:00

@Cguanqin commented on GitHub (Mar 8, 2024):

@Zbrooklyn更改自定义模型文件中的“num_thread”参数。

hei,bro~
it's still using 50% of the cores and 50% of RAM after modifying Modelfile that increased num_thread.from 8 to 16 。

my Modelfile is as follows:
FROM gemma:2b
PARAMETER num_thread 16

@Cguanqin commented on GitHub (Mar 8, 2024): > @Zbrooklyn更改自定义模型文件中的“num_thread”[参数](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter)。 hei,bro~ it's still using 50% of the cores and 50% of RAM after modifying Modelfile that increased num_thread.from 8 to 16 。 my Modelfile is as follows: FROM gemma:2b PARAMETER num_thread 16

GiteaMirror commented

2026-05-03 13:50:16 -05:00

@robertvazan commented on GitHub (Mar 8, 2024):

@Cguanqin You are probably doing something wrong. Try to write down reproducible steps. That will probably reveal the mistake.

@robertvazan commented on GitHub (Mar 8, 2024): @Cguanqin You are probably doing something wrong. Try to write down reproducible steps. That will probably reveal the mistake.

GiteaMirror commented

2026-05-03 13:50:20 -05:00

@Cguanqin commented on GitHub (Mar 8, 2024):

@Cguanqin You are probably doing something wrong. Try to write down reproducible steps. That will probably reveal the mistake.

oh，I don't know where the problem is. I want to run the Gemma: 2b model in Ollama. My machine is configured with 16CPU and 16RAM. Can you provide a modelfile reference? Thank you！

@Cguanqin commented on GitHub (Mar 8, 2024): > @Cguanqin You are probably doing something wrong. Try to write down reproducible steps. That will probably reveal the mistake. oh，I don't know where the problem is. I want to run the Gemma: 2b model in Ollama. My machine is configured with 16CPU and 16RAM. Can you provide a modelfile reference? Thank you！

GiteaMirror commented

2026-05-03 13:50:22 -05:00

@robertvazan commented on GitHub (Mar 8, 2024):

@Cguanqin Your modelfile looks alright. The problem is probably in the way you try to use it. I am using Open WebUI with Ollama, which provides convenient UI for modelfile editing. But if I were you, I wouldn't waste effort on raising thread count. I tried it and it does indeed worsen throughput. One thread per physical core is indeed optimal.

@robertvazan commented on GitHub (Mar 8, 2024): @Cguanqin Your modelfile looks alright. The problem is probably in the way you try to use it. I am using Open WebUI with Ollama, which provides convenient UI for modelfile editing. But if I were you, I wouldn't waste effort on raising thread count. I tried it and it does indeed worsen throughput. One thread per physical core is indeed optimal.

GiteaMirror commented

2026-05-03 13:50:23 -05:00

@AdamYLK commented on GitHub (Mar 15, 2024):

same issue for me，i find another issue
Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k)

i can't use -t to set thread on linux server

@AdamYLK commented on GitHub (Mar 15, 2024): same issue for me，i find another issue [Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k)](https://github.com/ggerganov/llama.cpp/issues/842) i can't use -t to set thread on linux server

GiteaMirror commented

2026-05-03 13:50:24 -05:00

@arbarbieri commented on GitHub (Apr 1, 2024):

@Zbrooklyn Change 'num_thread' parameter in custom modelfile.

Worked for me! Thanx!

@arbarbieri commented on GitHub (Apr 1, 2024): > @Zbrooklyn Change 'num_thread' [parameter in custom modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter). Worked for me! Thanx!

GiteaMirror commented

2026-05-03 13:50:26 -05:00

@johnalanwoods commented on GitHub (Apr 23, 2024):

Model data is memory mapped and shows up in file cache #. Note too, VIRT, RES & SHR memory # of the Ollama processes.

Sorry to necro the thread, but why is this done specifically? Are only random portions of the model used at runtime?

@johnalanwoods commented on GitHub (Apr 23, 2024): > Model data is memory mapped and shows up in file cache #. Note too, VIRT, RES & SHR memory # of the Ollama processes. Sorry to necro the thread, but why is this done specifically? Are only random portions of the model used at runtime?

GiteaMirror commented

2026-05-03 13:50:27 -05:00

@robertvazan commented on GitHub (Apr 23, 2024):

@johnalanwoods Not a llama.cpp/ollama developer, but let me guess. One of the advantages is that you can have model larger than RAM. It will be super slow, limited by SSD speed, but it will work. Another reason is that OS is given a chance to discard the pages when the model is not in use and load them back when model is used again. There might be other reasons.

@robertvazan commented on GitHub (Apr 23, 2024): @johnalanwoods Not a llama.cpp/ollama developer, but let me guess. One of the advantages is that you can have model larger than RAM. It will be super slow, limited by SSD speed, but it will work. Another reason is that OS is given a chance to discard the pages when the model is not in use and load them back when model is used again. There might be other reasons.

GiteaMirror commented

2026-05-03 13:50:30 -05:00

@johnalanwoods commented on GitHub (Apr 24, 2024):

Makes sense thanks!

@johnalanwoods commented on GitHub (Apr 24, 2024): Makes sense thanks!

GiteaMirror commented

2026-05-03 13:50:31 -05:00

@sekrett commented on GitHub (Apr 29, 2024):

But if I were you, I wouldn't waste effort on raising thread count. I tried it and it does indeed worsen throughput. One thread per physical core is indeed optimal.

This is true if you have hyper-threading enabled, but if disabled, only half of the CPU power is used.

@sekrett commented on GitHub (Apr 29, 2024): > But if I were you, I wouldn't waste effort on raising thread count. I tried it and it does indeed worsen throughput. One thread per physical core is indeed optimal. This is true if you have hyper-threading enabled, but if disabled, only half of the CPU power is used.

GiteaMirror commented

2026-05-03 13:50:33 -05:00

@dhiltgen commented on GitHub (May 2, 2024):

Yes, by default we only create a thread per physical core. Trying to map inference threads to hyperthreads thrashes the CPU and results in poorer performance.

@dhiltgen commented on GitHub (May 2, 2024): Yes, by default we only create a thread per physical core. Trying to map inference threads to hyperthreads thrashes the CPU and results in poorer performance.

GiteaMirror commented

2026-05-03 13:50:34 -05:00

@sekrett commented on GitHub (May 3, 2024):

I don't have hyper-treading at all (Core i5 9600K) and still 3 of 6 cores are used. How can ollama distinguish if hyper-threading is enabled?

@sekrett commented on GitHub (May 3, 2024): I don't have hyper-treading at all (Core i5 9600K) and still 3 of 6 cores are used. How can ollama distinguish if hyper-threading is enabled?

GiteaMirror commented

2026-05-03 13:50:35 -05:00

@dhiltgen commented on GitHub (May 3, 2024):

@sekrett the following might help shed some light...

ls /sys/devices/system/cpu/
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings

For reference, on a 4-core (8 hyperthread) Intel CPU I see something like this:

% ls /sys/devices/system/cpu/
cpu0  cpu2  cpu4  cpu6  cpufreq  hotplug       isolated    microcode  offline  possible  present  uevent
cpu1  cpu3  cpu5  cpu7  cpuidle  intel_pstate  kernel_max  modalias   online   power     smt      vulnerabilities
% cat /sys/devices/system/cpu/cpu*/topology/thread_siblings
11
22
44
88
11
22
44
88

@dhiltgen commented on GitHub (May 3, 2024): @sekrett the following might help shed some light... ``` ls /sys/devices/system/cpu/ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings ``` For reference, on a 4-core (8 hyperthread) Intel CPU I see something like this: ``` % ls /sys/devices/system/cpu/ cpu0 cpu2 cpu4 cpu6 cpufreq hotplug isolated microcode offline possible present uevent cpu1 cpu3 cpu5 cpu7 cpuidle intel_pstate kernel_max modalias online power smt vulnerabilities % cat /sys/devices/system/cpu/cpu*/topology/thread_siblings 11 22 44 88 11 22 44 88 ```

GiteaMirror commented

2026-05-03 13:50:36 -05:00

@wwjCMP commented on GitHub (May 3, 2024):

@sekrett the following might help shed some light...

ls /sys/devices/system/cpu/
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings

For reference, on a 4-core (8 hyperthread) Intel CPU I see something like this:

% ls /sys/devices/system/cpu/
cpu0  cpu2  cpu4  cpu6  cpufreq  hotplug       isolated    microcode  offline  possible  present  uevent
cpu1  cpu3  cpu5  cpu7  cpuidle  intel_pstate  kernel_max  modalias   online   power     smt      vulnerabilities
% cat /sys/devices/system/cpu/cpu*/topology/thread_siblings
11
22
44
88
11
22
44
88

If my machine does not have hyper-threading enabled, how can I set it up to allow more physical cores to be used by ollama?

@wwjCMP commented on GitHub (May 3, 2024): > @sekrett the following might help shed some light... > > ``` > ls /sys/devices/system/cpu/ > cat /sys/devices/system/cpu/cpu*/topology/thread_siblings > ``` > > For reference, on a 4-core (8 hyperthread) Intel CPU I see something like this: > > ``` > % ls /sys/devices/system/cpu/ > cpu0 cpu2 cpu4 cpu6 cpufreq hotplug isolated microcode offline possible present uevent > cpu1 cpu3 cpu5 cpu7 cpuidle intel_pstate kernel_max modalias online power smt vulnerabilities > % cat /sys/devices/system/cpu/cpu*/topology/thread_siblings > 11 > 22 > 44 > 88 > 11 > 22 > 44 > 88 > ``` If my machine does not have hyper-threading enabled, how can I set it up to allow more physical cores to be used by ollama?

GiteaMirror commented

2026-05-03 13:50:37 -05:00

@dhiltgen commented on GitHub (May 3, 2024):

@wwjCMP if your CPU does not have hyperthreads, then the "thread_siblings" above are supposed to show no siblings, and we should default to one thread per core. If that's not the behavior you're seeing, can you share the output of the above from your system?

@dhiltgen commented on GitHub (May 3, 2024): @wwjCMP if your CPU does not have hyperthreads, then the "thread_siblings" above are supposed to show no siblings, and we should default to one thread per core. If that's not the behavior you're seeing, can you share the output of the above from your system?

GiteaMirror commented

2026-05-03 13:50:37 -05:00

@wwjCMP commented on GitHub (May 3, 2024):

cpu0   cpu14  cpu2   cpu25  cpu30  cpu36  cpu41  cpu47  cpu52  cpu58  cpu63  cpu69  cpu74  cpu8   cpu85  cpu90  cpufreq                 kernel_max  power
cpu1   cpu15  cpu20  cpu26  cpu31  cpu37  cpu42  cpu48  cpu53  cpu59  cpu64  cpu7   cpu75  cpu80  cpu86  cpu91  cpuidle                 microcode   present
cpu10  cpu16  cpu21  cpu27  cpu32  cpu38  cpu43  cpu49  cpu54  cpu6   cpu65  cpu70  cpu76  cpu81  cpu87  cpu92  hotplug                 modalias    smt
cpu11  cpu17  cpu22  cpu28  cpu33  cpu39  cpu44  cpu5   cpu55  cpu60  cpu66  cpu71  cpu77  cpu82  cpu88  cpu93  intel_pstate            offline     uevent
cpu12  cpu18  cpu23  cpu29  cpu34  cpu4   cpu45  cpu50  cpu56  cpu61  cpu67  cpu72  cpu78  cpu83  cpu89  cpu94  intel_uncore_frequency  online      umwait_control
cpu13  cpu19  cpu24  cpu3   cpu35  cpu40  cpu46  cpu51  cpu57  cpu62  cpu68  cpu73  cpu79  cpu84  cpu9   cpu95  isolated                possible    vulnerabilities

@wwjCMP commented on GitHub (May 3, 2024): ``` cpu0 cpu14 cpu2 cpu25 cpu30 cpu36 cpu41 cpu47 cpu52 cpu58 cpu63 cpu69 cpu74 cpu8 cpu85 cpu90 cpufreq kernel_max power cpu1 cpu15 cpu20 cpu26 cpu31 cpu37 cpu42 cpu48 cpu53 cpu59 cpu64 cpu7 cpu75 cpu80 cpu86 cpu91 cpuidle microcode present cpu10 cpu16 cpu21 cpu27 cpu32 cpu38 cpu43 cpu49 cpu54 cpu6 cpu65 cpu70 cpu76 cpu81 cpu87 cpu92 hotplug modalias smt cpu11 cpu17 cpu22 cpu28 cpu33 cpu39 cpu44 cpu5 cpu55 cpu60 cpu66 cpu71 cpu77 cpu82 cpu88 cpu93 intel_pstate offline uevent cpu12 cpu18 cpu23 cpu29 cpu34 cpu4 cpu45 cpu50 cpu56 cpu61 cpu67 cpu72 cpu78 cpu83 cpu89 cpu94 intel_uncore_frequency online umwait_control cpu13 cpu19 cpu24 cpu3 cpu35 cpu40 cpu46 cpu51 cpu57 cpu62 cpu68 cpu73 cpu79 cpu84 cpu9 cpu95 isolated possible vulnerabilities ```

GiteaMirror commented

2026-05-03 13:50:38 -05:00

@wwjCMP commented on GitHub (May 3, 2024):

00000000,00000000,00000001
00000000,00000000,00000400
00000000,00000000,00000800
00000000,00000000,00001000
00000000,00000000,00002000
00000000,00000000,00004000
00000000,00000000,00008000
00000000,00000000,00010000
00000000,00000000,00020000
00000000,00000000,00040000
00000000,00000000,00080000
00000000,00000000,00000002
00000000,00000000,00100000
00000000,00000000,00200000
00000000,00000000,00400000
00000000,00000000,00800000
00000000,00000000,01000000
00000000,00000000,02000000
00000000,00000000,04000000
00000000,00000000,08000000
00000000,00000000,10000000
00000000,00000000,20000000
00000000,00000000,00000004
00000000,00000000,40000000
00000000,00000000,80000000
00000000,00000001,00000000
00000000,00000002,00000000
00000000,00000004,00000000
00000000,00000008,00000000
00000000,00000010,00000000
00000000,00000020,00000000
00000000,00000040,00000000
00000000,00000080,00000000
00000000,00000000,00000008
00000000,00000100,00000000
00000000,00000200,00000000
00000000,00000400,00000000
00000000,00000800,00000000
00000000,00001000,00000000
00000000,00002000,00000000
00000000,00004000,00000000
00000000,00008000,00000000
00000000,00010000,00000000
00000000,00020000,00000000
00000000,00000000,00000010
00000000,00040000,00000000
00000000,00080000,00000000
00000000,00100000,00000000
00000000,00200000,00000000
00000000,00400000,00000000
00000000,00800000,00000000
00000000,01000000,00000000
00000000,02000000,00000000
00000000,04000000,00000000
00000000,08000000,00000000
00000000,00000000,00000020
00000000,10000000,00000000
00000000,20000000,00000000
00000000,40000000,00000000
00000000,80000000,00000000
00000001,00000000,00000000
00000002,00000000,00000000
00000004,00000000,00000000
00000008,00000000,00000000
00000010,00000000,00000000
00000020,00000000,00000000
00000000,00000000,00000040
00000040,00000000,00000000
00000080,00000000,00000000
00000100,00000000,00000000
00000200,00000000,00000000
00000400,00000000,00000000
00000800,00000000,00000000
00001000,00000000,00000000
00002000,00000000,00000000
00004000,00000000,00000000
00008000,00000000,00000000
00000000,00000000,00000080
00010000,00000000,00000000
00020000,00000000,00000000
00040000,00000000,00000000
00080000,00000000,00000000
00100000,00000000,00000000
00200000,00000000,00000000
00400000,00000000,00000000
00800000,00000000,00000000
01000000,00000000,00000000
02000000,00000000,00000000
00000000,00000000,00000100
04000000,00000000,00000000
08000000,00000000,00000000
10000000,00000000,00000000
20000000,00000000,00000000
40000000,00000000,00000000
80000000,00000000,00000000
00000000,00000000,00000200

@wwjCMP commented on GitHub (May 3, 2024): ``` 00000000,00000000,00000001 00000000,00000000,00000400 00000000,00000000,00000800 00000000,00000000,00001000 00000000,00000000,00002000 00000000,00000000,00004000 00000000,00000000,00008000 00000000,00000000,00010000 00000000,00000000,00020000 00000000,00000000,00040000 00000000,00000000,00080000 00000000,00000000,00000002 00000000,00000000,00100000 00000000,00000000,00200000 00000000,00000000,00400000 00000000,00000000,00800000 00000000,00000000,01000000 00000000,00000000,02000000 00000000,00000000,04000000 00000000,00000000,08000000 00000000,00000000,10000000 00000000,00000000,20000000 00000000,00000000,00000004 00000000,00000000,40000000 00000000,00000000,80000000 00000000,00000001,00000000 00000000,00000002,00000000 00000000,00000004,00000000 00000000,00000008,00000000 00000000,00000010,00000000 00000000,00000020,00000000 00000000,00000040,00000000 00000000,00000080,00000000 00000000,00000000,00000008 00000000,00000100,00000000 00000000,00000200,00000000 00000000,00000400,00000000 00000000,00000800,00000000 00000000,00001000,00000000 00000000,00002000,00000000 00000000,00004000,00000000 00000000,00008000,00000000 00000000,00010000,00000000 00000000,00020000,00000000 00000000,00000000,00000010 00000000,00040000,00000000 00000000,00080000,00000000 00000000,00100000,00000000 00000000,00200000,00000000 00000000,00400000,00000000 00000000,00800000,00000000 00000000,01000000,00000000 00000000,02000000,00000000 00000000,04000000,00000000 00000000,08000000,00000000 00000000,00000000,00000020 00000000,10000000,00000000 00000000,20000000,00000000 00000000,40000000,00000000 00000000,80000000,00000000 00000001,00000000,00000000 00000002,00000000,00000000 00000004,00000000,00000000 00000008,00000000,00000000 00000010,00000000,00000000 00000020,00000000,00000000 00000000,00000000,00000040 00000040,00000000,00000000 00000080,00000000,00000000 00000100,00000000,00000000 00000200,00000000,00000000 00000400,00000000,00000000 00000800,00000000,00000000 00001000,00000000,00000000 00002000,00000000,00000000 00004000,00000000,00000000 00008000,00000000,00000000 00000000,00000000,00000080 00010000,00000000,00000000 00020000,00000000,00000000 00040000,00000000,00000000 00080000,00000000,00000000 00100000,00000000,00000000 00200000,00000000,00000000 00400000,00000000,00000000 00800000,00000000,00000000 01000000,00000000,00000000 02000000,00000000,00000000 00000000,00000000,00000100 04000000,00000000,00000000 08000000,00000000,00000000 10000000,00000000,00000000 20000000,00000000,00000000 40000000,00000000,00000000 80000000,00000000,00000000 00000000,00000000,00000200 ```

GiteaMirror commented

2026-05-03 13:50:38 -05:00

@wwjCMP commented on GitHub (May 3, 2024):

@wwjCMP if your CPU does not have hyperthreads, then the "thread_siblings" above are supposed to show no siblings, and we should default to one thread per core. If that's not the behavior you're seeing, can you share the output of the above from your system?

Here you are

@wwjCMP commented on GitHub (May 3, 2024): > @wwjCMP if your CPU does not have hyperthreads, then the "thread_siblings" above are supposed to show no siblings, and we should default to one thread per core. If that's not the behavior you're seeing, can you share the output of the above from your system? Here you are

GiteaMirror commented

2026-05-03 13:50:39 -05:00

@dhiltgen commented on GitHub (May 4, 2024):

Thanks for the output @wwjCMP. It does look like the thread_siblings are all unique so we should allocate 96 threads by default on this system. I'll try to figure out why this isn't happening.

As a workaround, you should be able to set num_thread to override our default behavior. For example:

% ollama run llama3
>>> /set parameter num_thread 96
Set parameter 'num_thread' to '96'
>>> why is the sky blue?
The color of the sky can vary depending on the time of day, atmospheric conditions, and location. However, under normal conditions, the sky typically appears blue
because of a phenomenon called Rayleigh scattering.
...

@dhiltgen commented on GitHub (May 4, 2024): Thanks for the output @wwjCMP. It does look like the thread_siblings are all unique so we **should** allocate 96 threads by default on this system. I'll try to figure out why this isn't happening. As a workaround, you should be able to set num_thread to override our default behavior. For example: ``` % ollama run llama3 >>> /set parameter num_thread 96 Set parameter 'num_thread' to '96' >>> why is the sky blue? The color of the sky can vary depending on the time of day, atmospheric conditions, and location. However, under normal conditions, the sky typically appears blue because of a phenomenon called Rayleigh scattering. ... ```

GiteaMirror commented

2026-05-03 13:50:40 -05:00

@oldgithubman commented on GitHub (May 21, 2024):

The current logic is completely borked. On my 13900K (24-core, 32-thread), ollama defaults to using four cores. If I set it to use 24 cores, it uses 16. If I set it to use 32, it uses 24. The default should be to use all the physical cores, which you say is the current default, but it isn't. If the user sets num_threads (why isn't this a global setting?), ollama should use the number of threads the user set, regardless of performance

@oldgithubman commented on GitHub (May 21, 2024): The current logic is completely borked. On my 13900K (24-core, 32-thread), ollama defaults to using four cores. If I set it to use 24 cores, it uses 16. If I set it to use 32, it uses 24. The default should be to use all the physical cores, which you say *is* the current default, but it isn't. If the user sets num_threads (why isn't this a global setting?), ollama should use the number of threads the user set, *regardless of performance*

GiteaMirror commented

2026-05-03 13:50:41 -05:00

@Googlepuss commented on GitHub (Jun 5, 2024):

I have ollama set up on VM for testing, with 12 vCPU (4 socket & 3 core topology) and 16GB RAM (no GPU). I am not sure where to see the global default num_thread from CLI, but open-webui indicates "2". I came to this thread looking for a reason why RAM has almost zero utilization (maybe 2-3GB of available 16), while the CPU seems to be completely taxed by every query. I could throw more resources at it, but with the RAM seemingly not used, I am wondering if there is a configuration I have overlooked. most everything is the default.

me@follama:$ ls /sys/devices/system/cpu/
cpu0 cpu10 cpu2 cpu4 cpu6 cpu8 cpufreq crash_hotplug isolated modalias offline possible present uevent
cpu1 cpu11 cpu3 cpu5 cpu7 cpu9 cpuidle hotplug kernel_max nohz_full online power smt vulnerabilities
me@follama:$ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings
00000000,00000001
00000000,00000400
00000000,00000800
00000000,00000002
00000000,00000004
00000000,00000008
00000000,00000010
00000000,00000020
00000000,00000040
00000000,00000080
00000000,00000100
00000000,00000200

@Googlepuss commented on GitHub (Jun 5, 2024): I have ollama set up on VM for testing, with 12 vCPU (4 socket & 3 core topology) and 16GB RAM (no GPU). I am not sure where to see the global default num_thread from CLI, but open-webui indicates "2". I came to this thread looking for a reason why RAM has almost zero utilization (maybe 2-3GB of available 16), while the CPU seems to be completely taxed by every query. I could throw more resources at it, but with the RAM seemingly not used, I am wondering if there is a configuration I have overlooked. most everything is the default. me@follama:~$ ls /sys/devices/system/cpu/ cpu0 cpu10 cpu2 cpu4 cpu6 cpu8 cpufreq crash_hotplug isolated modalias offline possible present uevent cpu1 cpu11 cpu3 cpu5 cpu7 cpu9 cpuidle hotplug kernel_max nohz_full online power smt vulnerabilities me@follama:~$ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings 00000000,00000001 00000000,00000400 00000000,00000800 00000000,00000002 00000000,00000004 00000000,00000008 00000000,00000010 00000000,00000020 00000000,00000040 00000000,00000080 00000000,00000100 00000000,00000200

GiteaMirror commented

2026-05-03 13:50:42 -05:00

@oldgithubman commented on GitHub (Jun 5, 2024):

I have ollama set up on VM for testing, with 12 vCPU (4 socket & 3 core topology) and 16GB RAM (no GPU). I am not sure where to see the global default num_thread from CLI, but open-webui indicates "2". I came to this thread looking for a reason why RAM has almost zero utilization (maybe 2-3GB of available 16), while the CPU seems to be completely taxed by every query. I could throw more resources at it, but with the RAM seemingly not used, I am wondering if there is a configuration I have overlooked. most everything is the default.

me@follama:~ ls /sys/devices/system/cpu/ cpu0 cpu10 cpu2 cpu4 cpu6 cpu8 cpufreq crash_hotplug isolated modalias offline possible present uevent cpu1 cpu11 cpu3 cpu5 cpu7 cpu9 cpuidle hotplug kernel_max nohz_full online power smt vulnerabilities me@follama:~ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings 00000000,00000001 00000000,00000400 00000000,00000800 00000000,00000002 00000000,00000004 00000000,00000008 00000000,00000010 00000000,00000020 00000000,00000040 00000000,00000080 00000000,00000100 00000000,00000200

If I understand correctly, the low RAM usage is an illusion due to the way linux memory mapping works. Pull up a tool like glances and watch the IO to your drive. As long as it's not obviously streaming from your drive, you're probably fine

@oldgithubman commented on GitHub (Jun 5, 2024): > I have ollama set up on VM for testing, with 12 vCPU (4 socket & 3 core topology) and 16GB RAM (no GPU). I am not sure where to see the global default num_thread from CLI, but open-webui indicates "2". I came to this thread looking for a reason why RAM has almost zero utilization (maybe 2-3GB of available 16), while the CPU seems to be completely taxed by every query. I could throw more resources at it, but with the RAM seemingly not used, I am wondering if there is a configuration I have overlooked. most everything is the default. > > me@follama:~$ ls /sys/devices/system/cpu/ cpu0 cpu10 cpu2 cpu4 cpu6 cpu8 cpufreq crash_hotplug isolated modalias offline possible present uevent cpu1 cpu11 cpu3 cpu5 cpu7 cpu9 cpuidle hotplug kernel_max nohz_full online power smt vulnerabilities me@follama:~$ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings 00000000,00000001 00000000,00000400 00000000,00000800 00000000,00000002 00000000,00000004 00000000,00000008 00000000,00000010 00000000,00000020 00000000,00000040 00000000,00000080 00000000,00000100 00000000,00000200 If I understand correctly, the low RAM usage is an illusion due to the way linux memory mapping works. Pull up a tool like glances and watch the IO to your drive. As long as it's not obviously streaming from your drive, you're probably fine

GiteaMirror commented

2026-05-03 13:50:42 -05:00

@Googlepuss commented on GitHub (Jun 6, 2024):

Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out.

@Googlepuss commented on GitHub (Jun 6, 2024): Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out. <img width="1680" alt="Screenshot 2024-06-05 at 7 35 03 PM" src="https://github.com/ollama/ollama/assets/110204795/519bbd71-049b-49a7-b8f8-8c53589e3caa"> <img width="270" alt="Screenshot 2024-06-05 at 7 37 05 PM" src="https://github.com/ollama/ollama/assets/110204795/cc4ea422-1c67-4110-aa07-ebfc7e7f9559">

GiteaMirror commented

2026-05-03 13:50:43 -05:00

@oldgithubman commented on GitHub (Jun 6, 2024):

Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out.

Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM should show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely

Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU.

(I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize)

@oldgithubman commented on GitHub (Jun 6, 2024): > Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out. > > <img width="1680" alt="Screenshot 2024-06-05 at 7 35 03 PM" src="https://github.com/ollama/ollama/assets/110204795/519bbd71-049b-49a7-b8f8-8c53589e3caa"> > <img width="270" alt="Screenshot 2024-06-05 at 7 37 05 PM" src="https://github.com/ollama/ollama/assets/110204795/cc4ea422-1c67-4110-aa07-ebfc7e7f9559"> > > Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM *should* show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU. (I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize)

GiteaMirror commented

2026-05-03 13:50:44 -05:00

@Googlepuss commented on GitHub (Jun 6, 2024):

Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out.

Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM should show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely

Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU.

(I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize)

No worries at all. thanks for the feedback. yes "virtual RAM". I am interested in your remark about hyperthreading, I have 2 intel E5-2680 v3 on host and VM env managed by xcp-ng/XOA. So, the hypervisor sees "48 Core 2 Socket" and "Hyperthread enabled". But in bare metal terms, it is 2 Socket, 24 Core, and 48 Thread. With hyperthread disabled on the host, this would bring the available vCPU count from 48 to 24 (believe). is this problematic (having hyperthreading enabled)?

@Googlepuss commented on GitHub (Jun 6, 2024): > > Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out. > > <img alt="Screenshot 2024-06-05 at 7 35 03 PM" width="1680" src="https://private-user-images.githubusercontent.com/110204795/337067932-519bbd71-049b-49a7-b8f8-8c53589e3caa.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTc2NDAwMjksIm5iZiI6MTcxNzYzOTcyOSwicGF0aCI6Ii8xMTAyMDQ3OTUvMzM3MDY3OTMyLTUxOWJiZDcxLTA0OWItNDlhNy1iOGY4LThjNTM1ODllM2NhYS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYwNlQwMjA4NDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05M2MxMjNmZDVjMzYyNDMyODBhOGYyZDhkODFkZWE1ZDkwNmNjYWUxOTI1MWRhYzJjNjczNDI2ZDlkODkwYjE2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.-YlO8Fbi46wt4XP0BHqsXp7CNUtaMXn2sfqYrqJzZz4"> > > <img alt="Screenshot 2024-06-05 at 7 37 05 PM" width="270" src="https://private-user-images.githubusercontent.com/110204795/337067936-cc4ea422-1c67-4110-aa07-ebfc7e7f9559.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTc2NDAwMjksIm5iZiI6MTcxNzYzOTcyOSwicGF0aCI6Ii8xMTAyMDQ3OTUvMzM3MDY3OTM2LWNjNGVhNDIyLTFjNjctNDExMC1hYTA3LWViZmM3ZTdmOTU1OS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYwNlQwMjA4NDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lZmRhOWRjNTU4MTYyNDQzZGViZmUwODEzMjk0ZjdkZTIxMzVlMDdjMTZkMGYxOWQwYzVjMTEyNTBmYjhmNzA2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.kpDbsKwgqgheRtl81CB9_VhQTA2tm5rWdJqV_7SJ2Kc"> > > Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM _should_ show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely > > Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU. > > (I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize) No worries at all. thanks for the feedback. yes "virtual RAM". I am interested in your remark about hyperthreading, I have 2 [intel E5-2680 v3](https://ark.intel.com/content/www/us/en/ark/products/81908/intel-xeon-processor-e5-2680-v3-30m-cache-2-50-ghz.html) on host and VM env managed by xcp-ng/XOA. So, the hypervisor sees "48 Core 2 Socket" and "Hyperthread enabled". But in bare metal terms, it is 2 Socket, 24 Core, and 48 Thread. With hyperthread disabled on the host, this would bring the available vCPU count from 48 to 24 (believe). is this problematic (having hyperthreading enabled)?

GiteaMirror commented

2026-05-03 13:50:44 -05:00

@oldgithubman commented on GitHub (Jun 6, 2024):

Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out.

Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM should show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely

Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU.

(I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize)

No worries at all. thanks for the feedback. yes "virtual RAM". I am interested in your remark about hyperthreading, I have 2 intel E5-2680 v3 on host and VM env managed by xcp-ng/XOA. So, the hypervisor sees "48 Core 2 Socket" and "Hyperthread enabled". But in bare metal terms, it is 2 Socket, 24 Core, and 48 Thread. With hyperthread disabled on the host, this would bring the available vCPU count from 48 to 24 (believe). is this problematic (having hyperthreading enabled)?

It's not a problem for hyperthreading to be enabled, no. You'll just get better performance (usually - test it) if you restrict inference to physical cores. Also worth testing running inference just on the physical cores of one CPU, depending on memory layout, etc. You're probably losing performance from inter-CPU communication overhead. Limiting to one CPU might cut your available memory in half though. Basically, test all these things if you're looking for better performance, but again, you're much better off using a GPU if you can

@oldgithubman commented on GitHub (Jun 6, 2024): > > > Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out. > > > <img alt="Screenshot 2024-06-05 at 7 35 03 PM" width="1680" src="https://private-user-images.githubusercontent.com/110204795/337067932-519bbd71-049b-49a7-b8f8-8c53589e3caa.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTc2NDAwMjksIm5iZiI6MTcxNzYzOTcyOSwicGF0aCI6Ii8xMTAyMDQ3OTUvMzM3MDY3OTMyLTUxOWJiZDcxLTA0OWItNDlhNy1iOGY4LThjNTM1ODllM2NhYS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYwNlQwMjA4NDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05M2MxMjNmZDVjMzYyNDMyODBhOGYyZDhkODFkZWE1ZDkwNmNjYWUxOTI1MWRhYzJjNjczNDI2ZDlkODkwYjE2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.-YlO8Fbi46wt4XP0BHqsXp7CNUtaMXn2sfqYrqJzZz4"> > > > <img alt="Screenshot 2024-06-05 at 7 37 05 PM" width="270" src="https://private-user-images.githubusercontent.com/110204795/337067936-cc4ea422-1c67-4110-aa07-ebfc7e7f9559.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTc2NDAwMjksIm5iZiI6MTcxNzYzOTcyOSwicGF0aCI6Ii8xMTAyMDQ3OTUvMzM3MDY3OTM2LWNjNGVhNDIyLTFjNjctNDExMC1hYTA3LWViZmM3ZTdmOTU1OS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYwNlQwMjA4NDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lZmRhOWRjNTU4MTYyNDQzZGViZmUwODEzMjk0ZjdkZTIxMzVlMDdjMTZkMGYxOWQwYzVjMTEyNTBmYjhmNzA2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.kpDbsKwgqgheRtl81CB9_VhQTA2tm5rWdJqV_7SJ2Kc"> > > > > Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM _should_ show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely > > > > Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU. > > > > (I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize) > > No worries at all. thanks for the feedback. yes "virtual RAM". I am interested in your remark about hyperthreading, I have 2 [intel E5-2680 v3](https://ark.intel.com/content/www/us/en/ark/products/81908/intel-xeon-processor-e5-2680-v3-30m-cache-2-50-ghz.html) on host and VM env managed by xcp-ng/XOA. So, the hypervisor sees "48 Core 2 Socket" and "Hyperthread enabled". But in bare metal terms, it is 2 Socket, 24 Core, and 48 Thread. With hyperthread disabled on the host, this would bring the available vCPU count from 48 to 24 (believe). is this problematic (having hyperthreading enabled)? It's not a problem for hyperthreading to be enabled, no. You'll just get better performance (usually - test it) if you restrict inference to physical cores. Also worth testing running inference just on the physical cores of one CPU, depending on memory layout, etc. You're probably losing performance from inter-CPU communication overhead. Limiting to one CPU might cut your available memory in half though. Basically, test all these things if you're looking for better performance, but again, you're much better off using a GPU if you can

GiteaMirror commented

2026-05-03 13:50:45 -05:00

@Googlepuss commented on GitHub (Jun 6, 2024):

Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out.

Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM should show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely
Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU.
(I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize)

No worries at all. thanks for the feedback. yes "virtual RAM". I am interested in your remark about hyperthreading, I have 2 intel E5-2680 v3 on host and VM env managed by xcp-ng/XOA. So, the hypervisor sees "48 Core 2 Socket" and "Hyperthread enabled". But in bare metal terms, it is 2 Socket, 24 Core, and 48 Thread. With hyperthread disabled on the host, this would bring the available vCPU count from 48 to 24 (believe). is this problematic (having hyperthreading enabled)?

It's not a problem for hyperthreading to be enabled, no. You'll just get better performance (usually - test it) if you restrict inference to physical cores. Also worth testing running inference just on the physical cores of one CPU, depending on memory layout, etc. You're probably losing performance from inter-CPU communication overhead. Limiting to one CPU might cut your available memory in half though. Basically, test all these things if you're looking for better performance, but again, you're much better off using a GPU if you can

Thanks! I will take the direction and play with the CPU mappings. Agree on the GPU, just trying to get the CPU-only config optimal before I work in GPU and likely a dedicated host. Appreciate it.

@Googlepuss commented on GitHub (Jun 6, 2024): > > > > Thanks, @oldmanjk! had not used glances prior, and is super useful. Attaching screenshots when running basic questions "sky blue", "tell a joke", "short story", etc. Disck i/o doesn't stand out, CPU stays pegged, mem never exceeds 6.9% (of 16GB vRAM). this is all llama3:latest and /set parameter num_gpu 0 num_thread 12. not sure where to go next I guess. CPU is not super modern but should be able to handle tell me a joke without pegging out. > > > > <img alt="Screenshot 2024-06-05 at 7 35 03 PM" width="1680" src="https://private-user-images.githubusercontent.com/110204795/337067932-519bbd71-049b-49a7-b8f8-8c53589e3caa.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTc2NDAwMjksIm5iZiI6MTcxNzYzOTcyOSwicGF0aCI6Ii8xMTAyMDQ3OTUvMzM3MDY3OTMyLTUxOWJiZDcxLTA0OWItNDlhNy1iOGY4LThjNTM1ODllM2NhYS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYwNlQwMjA4NDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05M2MxMjNmZDVjMzYyNDMyODBhOGYyZDhkODFkZWE1ZDkwNmNjYWUxOTI1MWRhYzJjNjczNDI2ZDlkODkwYjE2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.-YlO8Fbi46wt4XP0BHqsXp7CNUtaMXn2sfqYrqJzZz4"> > > > > <img alt="Screenshot 2024-06-05 at 7 37 05 PM" width="270" src="https://private-user-images.githubusercontent.com/110204795/337067936-cc4ea422-1c67-4110-aa07-ebfc7e7f9559.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTc2NDAwMjksIm5iZiI6MTcxNzYzOTcyOSwicGF0aCI6Ii8xMTAyMDQ3OTUvMzM3MDY3OTM2LWNjNGVhNDIyLTFjNjctNDExMC1hYTA3LWViZmM3ZTdmOTU1OS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYwNlQwMjA4NDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lZmRhOWRjNTU4MTYyNDQzZGViZmUwODEzMjk0ZjdkZTIxMzVlMDdjMTZkMGYxOWQwYzVjMTEyNTBmYjhmNzA2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.kpDbsKwgqgheRtl81CB9_VhQTA2tm5rWdJqV_7SJ2Kc"> > > > > > > > > > Yeah, I discovered glances recently and like it so far. It has a lot of depth I haven't had the time to explore yet. I thought you were talking about system RAM, not VRAM. VRAM _should_ show as used when in use, so something isn't offloading right. num_gpu is how many layers you want offloaded to gpu, so that explains that. Assuming you want to utilize your gpu more, you want to increase that number, or if you just want ollama to use most of your gpu, delete that parameter entirely > > > Edit - I see now you mean virtual RAM. I didn't catch the no-gpu thing earlier. Yeah, if you're not using gpu, your CPU has to do all the work, so you should expect full usage. It will go all-out until it's done. Full usage is actually a good thing, because it means you're not bottlenecked by IO. If you want less usage so you can do other things simultaneously, tell ollama to use fewer threads (make sure none of those vCPU's are mapped to vcores). Ideally, you want to offload to a GPU. > > > (I keep realizing I'm misreading what you wrote - I'm a bit off atm - so if I screwed anything further up, I apologize) > > > > > > No worries at all. thanks for the feedback. yes "virtual RAM". I am interested in your remark about hyperthreading, I have 2 [intel E5-2680 v3](https://ark.intel.com/content/www/us/en/ark/products/81908/intel-xeon-processor-e5-2680-v3-30m-cache-2-50-ghz.html) on host and VM env managed by xcp-ng/XOA. So, the hypervisor sees "48 Core 2 Socket" and "Hyperthread enabled". But in bare metal terms, it is 2 Socket, 24 Core, and 48 Thread. With hyperthread disabled on the host, this would bring the available vCPU count from 48 to 24 (believe). is this problematic (having hyperthreading enabled)? > > It's not a problem for hyperthreading to be enabled, no. You'll just get better performance (usually - test it) if you restrict inference to physical cores. Also worth testing running inference just on the physical cores of one CPU, depending on memory layout, etc. You're probably losing performance from inter-CPU communication overhead. Limiting to one CPU might cut your available memory in half though. Basically, test all these things if you're looking for better performance, but again, you're much better off using a GPU if you can Thanks! I will take the direction and play with the CPU mappings. Agree on the GPU, just trying to get the CPU-only config optimal before I work in GPU and likely a dedicated host. Appreciate it.

GiteaMirror commented

2026-05-03 13:50:46 -05:00

@sekrett commented on GitHub (Jun 7, 2024):

You'll just get better performance (usually - test it) if you restrict inference to physical cores. Also worth testing running inference just on the physical cores of one CPU, depending on memory layout, etc. You're probably losing performance from inter-CPU communication overhead.

Completely agree, I tested on different desktops and Xeons, without HT you get code compilation faster by setting make -j8 for example.

@sekrett commented on GitHub (Jun 7, 2024): > You'll just get better performance (usually - test it) if you restrict inference to physical cores. Also worth testing running inference just on the physical cores of one CPU, depending on memory layout, etc. You're probably losing performance from inter-CPU communication overhead. Completely agree, I tested on different desktops and Xeons, without HT you get code compilation faster by setting `make -j8` for example.

GiteaMirror commented

2026-05-03 13:50:48 -05:00

@jasonwang178 commented on GitHub (Jul 2, 2024):

Hi, how can I set the num_thread for ollama serve instead of ollama run?

@jasonwang178 commented on GitHub (Jul 2, 2024): Hi, how can I set the `num_thread` for `ollama serve` instead of `ollama run`?

GiteaMirror commented

2026-05-03 13:50:49 -05:00

@jasonwang178 commented on GitHub (Jul 2, 2024):

Hi, how can I set the num_thread for ollama serve instead of ollama run?

Alright, I found the solution for ollama serve. Simply add the num_thread parameter when making the API request.

Reference: https://github.com/ollama/ollama/blob/main/docs/api.md#request-6

@jasonwang178 commented on GitHub (Jul 2, 2024): > Hi, how can I set the `num_thread` for `ollama serve` instead of `ollama run`? Alright, I found the solution for `ollama serve`. Simply add the `num_thread` parameter when making the API request. ![image](https://github.com/ollama/ollama/assets/222802/deb4ef32-ba60-4ad9-a978-8f0375ca72af) Reference: https://github.com/ollama/ollama/blob/main/docs/api.md#request-6

GiteaMirror commented

2026-05-03 13:50:50 -05:00

@GregChiang0201 commented on GitHub (Jul 29, 2024):

Hi, how can I set the num_thread for ollama serve instead of ollama run?

Alright, I found the solution for ollama serve. Simply add the num_thread parameter when making the API request.

Reference: https://github.com/ollama/ollama/blob/main/docs/api.md#request-6

Excuse me, is there any option can change num_thread permanently? Since I can only run the instructions above every time to set the custom threads, or the system will use half of my cores to run it.

@GregChiang0201 commented on GitHub (Jul 29, 2024): > > Hi, how can I set the `num_thread` for `ollama serve` instead of `ollama run`? > > Alright, I found the solution for `ollama serve`. Simply add the `num_thread` parameter when making the API request. > > ![image](https://private-user-images.githubusercontent.com/222802/344922090-deb4ef32-ba60-4ad9-a978-8f0375ca72af.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIyNDEwODYsIm5iZiI6MTcyMjI0MDc4NiwicGF0aCI6Ii8yMjI4MDIvMzQ0OTIyMDkwLWRlYjRlZjMyLWJhNjAtNGFkOS1hOTc4LThmMDM3NWNhNzJhZi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyOVQwODEzMDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wNmE1ZDVhYjM3ODc0ODU3MTE2NDIzMGFkMGVkYTQ3ZTc0MmY0Y2Y3Y2FiMDQyNTI0YTZkMTk1MTNkMzQ0MzM2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.SSDlNwvpSpk7INZhgFg5H3jsPml_Z0r9bCQyARBndvI) > > Reference: https://github.com/ollama/ollama/blob/main/docs/api.md#request-6 Excuse me, is there any option can change num_thread permanently? Since I can only run the instructions above every time to set the custom threads, or the system will use half of my cores to run it.

GiteaMirror commented

2026-05-03 13:50:51 -05:00

@dbustosrc commented on GitHub (Jul 30, 2024):

#5554

@dbustosrc commented on GitHub (Jul 30, 2024): #5554

GiteaMirror commented

2026-05-03 13:50:52 -05:00

@tigran123 commented on GitHub (Oct 1, 2025):

Oh dear, how confident are we in forcing a policy, where only a mechanism should be provided, forgetting the wise words of Linus Torvalds back from early 1990s, explaining that OS (and software in general) should never force any stupid (i.e. stupid to a user who knows perhaps better than the developer) policy down all users' throat and only providing mechanisms, leaving it up to the user to decide, which policy he ought to adopt.

Now, in this case, I would say it is very naive to assume that everyone's CPU would behave in exactly the same way that a particular CPU that the developer has tested on does. The whole point of virtual threads appearing as "separate CPUs" (e.g. in /proc/cpuinfo) is that they ought to really be used as separate CPUs in the sense of MP specification. That was the case 20 years ago and it is true today. Yes, in some (in many, I agree) scenarios it would be inefficient to use more than the number of physical cores, i.e. more than half of the virtual threads, but in others it would be more efficient. Otherwise there would be no such thing as HT support.

So, please provide a way (other than passing "num_thread" via API, unless it is possible to set it via Open WebUI, is it?) to set the num_thread somehow. Again, without having to create a new model via Modelfile specifically for this purpose.

I tried setting OLLAMA_NUM_THREADS environment variable to 12 and it did make a difference only at the initial load (which became almost twice as fast, btw!) but during the actual inference it reverted back to 6. I have Intel Core i7-6800K CPU which has 6 cores and 12 threads (of course HT is enabled, as it generally makes everything TWICE as fast -- that is the whole point of HT).

I found the PR for OLLAMA_NUM_PARALLEL, but it is not merged yet:

https://github.com/ollama/ollama/pull/9546

So, are there any workarounds, please?

UPDATE: Could it be that I simply misspelled OLLAMA_NUM_THREADS and it should be OLLAMA_NUM_THREAD instead? I will try both and see.

UPDATE: No, unfortunately, I still have 50% CPU utilisation, i.e. num_thread is internally set to 6, ignoring both OLLAMA_NUM_THREAD and OLLAMA_NUM_THREADS environment variables.

@tigran123 commented on GitHub (Oct 1, 2025): Oh dear, how confident are we in forcing a policy, where only a mechanism should be provided, forgetting the wise words of Linus Torvalds back from early 1990s, explaining that OS (and software in general) should never force any stupid (i.e. stupid to a user who knows perhaps better than the developer) _policy_ down all users' throat and only providing _mechanisms_, leaving it up to the user to decide, which policy he ought to adopt. Now, in this case, I would say it is very naive to assume that everyone's CPU would behave in exactly the same way that a particular CPU that the developer has tested on does. The whole point of virtual threads appearing as "separate CPUs" (e.g. in /proc/cpuinfo) is that they ought to really be used as separate CPUs in the sense of MP specification. That was the case 20 years ago and it is true today. Yes, in some (in many, I agree) scenarios it would be inefficient to use more than the number of physical cores, i.e. more than half of the virtual threads, but in others it would be more efficient. Otherwise there would be no such thing as HT support. So, please provide a way (other than passing "num_thread" via API, unless it is possible to set it via Open WebUI, is it?) to set the num_thread somehow. Again, without having to create a new model via Modelfile specifically for this purpose. I tried setting OLLAMA_NUM_THREADS environment variable to 12 and it did make a difference only at the initial load (which became almost twice as fast, btw!) but during the actual inference it reverted back to 6. I have Intel Core i7-6800K CPU which has 6 cores and 12 threads (of course HT is enabled, as it generally makes everything TWICE as fast -- that is the whole point of HT). I found the PR for OLLAMA_NUM_PARALLEL, but it is not merged yet: https://github.com/ollama/ollama/pull/9546 So, are there any workarounds, please? UPDATE: Could it be that I simply misspelled OLLAMA_NUM_THREADS and it should be OLLAMA_NUM_THREAD instead? I will try both and see. UPDATE: No, unfortunately, I still have 50% CPU utilisation, i.e. num_thread is internally set to 6, ignoring both OLLAMA_NUM_THREAD and OLLAMA_NUM_THREADS environment variables.

GiteaMirror commented

2026-05-03 13:50:52 -05:00

@tigran123 commented on GitHub (Oct 1, 2025):

UPDATE: I was reluctant to do ollama create model -f modelfile because I thought that it was going to duplicate the whole 65GB file :)

Now I have created a num_thread 12 version of gpt-oss:120b and compared its performance to the default one -- indeed, it is twice as slow -- 40 seconds vs 21 seconds for inference on the same prompt (from fresh state of ollama serve).

Thank you for your patience :)

@tigran123 commented on GitHub (Oct 1, 2025): UPDATE: I was reluctant to do `ollama create model -f modelfile` because I thought that it was going to duplicate the whole 65GB file :) Now I have created a `num_thread 12` version of gpt-oss:120b and compared its performance to the default one -- indeed, it is twice as slow -- 40 seconds vs 21 seconds for inference on the same prompt (from fresh state of `ollama serve`). Thank you for your patience :)

GiteaMirror commented

2026-05-03 13:50:53 -05:00

@tigran123 commented on GitHub (Oct 5, 2025):

Another update: I don't know how did I get those "40 vs 21 seconds" numbers before, but now I consistently get the opposite, namely: 17-19 seconds for the 12 threads version (100% cpu utilisation) and 28-31 seconds for the 6 threads version (50% CPU utilisation). Here are the screenshots and I can easily reproduce it. So, the default num_thread is set wrongly, after all. It should be set to the logically obvious value -- the number of CPUs (i.e. virtual threads) and not the number of physical cores as it does currently. At least on my Intel Core i7-6800K system with 128GB RAM.

@tigran123 commented on GitHub (Oct 5, 2025): Another update: I don't know how did I get those "40 vs 21 seconds" numbers before, but now I _consistently_ get the opposite, namely: 17-19 seconds for the 12 threads version (100% cpu utilisation) and 28-31 seconds for the 6 threads version (50% CPU utilisation). Here are the screenshots and I can easily reproduce it. So, the default num_thread is set wrongly, after all. It should be set to the logically obvious value -- the number of CPUs (i.e. virtual threads) and not the number of physical cores as it does currently. At least on my Intel Core i7-6800K system with 128GB RAM. <img width="1719" height="360" alt="Image" src="https://github.com/user-attachments/assets/f585cfeb-9417-4370-b1d4-760edde88dfb" /> <img width="1719" height="360" alt="Image" src="https://github.com/user-attachments/assets/dd1a1b42-41c7-4536-9b6a-758b1b84c78d" />

GiteaMirror commented

2026-05-03 13:50:54 -05:00

@minyor commented on GitHub (Oct 10, 2025):

Hello, I have same problem on our E5-2699 v3 server with 70+ cores.
Small models on ollama version is 0.1.38 loads and works fast under 4 seconds per request
But starting from ollama version 0.1.39 and up to latest version the loading and resposes take up to 15 minutes...
Setting OLLAMA_NUM_THREADS or OLLAMA_NUM_THREAD nor num_thread to 70 wont help..
Please adwise, are we really stack with this old ollama?

@minyor commented on GitHub (Oct 10, 2025): Hello, I have same problem on our E5-2699 v3 server with 70+ cores. Small models on ollama version is 0.1.38 loads and works fast under 4 seconds per request But starting from ollama version 0.1.39 and up to latest version the loading and resposes take up to 15 minutes... Setting OLLAMA_NUM_THREADS or OLLAMA_NUM_THREAD nor num_thread to 70 wont help.. Please adwise, are we really stack with this old ollama?

GiteaMirror commented

2026-05-03 13:50:56 -05:00

@tigran123 commented on GitHub (Oct 10, 2025):

Why not do what I did -- have a trivial Modelfile like this:

FROM modelname
PARAMETER num_thread <N>

then it works perfectly if num_thread is matching the number of real CPUs in the SMP sense (not some secondary notion of the so-called "physical cores" which is not useful and that is why it is not normally exported via API -- though one can manually count the core id values in /proc/cpuinfo, of course).

It is sad that the Ollama developers are stubbornly refusing to change the defaults, presumably because some exotic tests done by someone suggested that using only half of the available CPUs is somehow beneficial. But this is not critical -- just create a proper model clone for each of your models (i.e. with num_thread matching the number of CPUs) and everything works fine, and much faster than with this unfortunate default.

@tigran123 commented on GitHub (Oct 10, 2025): Why not do what I did -- have a trivial Modelfile like this: ``` FROM modelname PARAMETER num_thread <N> ``` then it works perfectly if `num_thread` is matching the number of real CPUs in the SMP sense (not some secondary notion of the so-called "physical cores" which is not useful and that is why it is not normally exported via API -- though one can manually count the `core id` values in `/proc/cpuinfo`, of course). It is sad that the Ollama developers are stubbornly refusing to change the defaults, presumably because some exotic tests done by someone suggested that using only half of the available CPUs is somehow beneficial. But this is not critical -- just create a proper model clone for each of your models (i.e. with num_thread matching the number of CPUs) and everything works fine, and much faster than with this unfortunate default.

GiteaMirror commented

2026-05-03 13:50:56 -05:00

@minyor commented on GitHub (Oct 10, 2025):

But I tried this, I modified existing model parameter like so:

ollama run tinyllama:1.1b
 /set parameter num_thread 72
 /save tinyllama:1.1b
 Ctrl+D
service ollama restart

But this only made my model run slow even on an 0.1.38 version of ollama, not only on >=0.1.39
Only deleting a model and redownloading it helped to make it run fast again on 0.1.38
And I do not mean a little slow, but about 90 times slower...
I surelly must've did something incorrect

Edit: Also, when I run requests, all the 72 cores are at 100% of load, whole these 15 slow minutes...
If this is a problem with number of threads then shouldn't only one or some cores be loaded instead of all of them?

Edit: If I edit the model again but this time set num_threads 0 likeso:

ollama run tinyllama:1.1b
 /set parameter num_thread 0
 /save tinyllama:1.1b
 Ctrl+D
service ollama restart

Then It again starts to work fast but only in ollama <=0.1.38

@minyor commented on GitHub (Oct 10, 2025): But I tried this, I modified existing model parameter like so: >ollama run tinyllama:1.1b >> /set parameter num_thread 72 >> /save tinyllama:1.1b >> Ctrl+D >service ollama restart But this only made my model run slow even on an 0.1.38 version of ollama, not only on >=0.1.39 Only deleting a model and redownloading it helped to make it run fast again on 0.1.38 And I do not mean a little slow, but about 90 times slower... I surelly must've did something incorrect Edit: Also, when I run requests, all the 72 cores are at 100% of load, whole these 15 slow minutes... If this is a problem with number of threads then shouldn't only one or some cores be loaded instead of all of them? Edit: If I edit the model again but this time set num_threads 0 likeso: >ollama run tinyllama:1.1b >> /set parameter num_thread 0 >> /save tinyllama:1.1b >> Ctrl+D >service ollama restart Then It again starts to work fast but only in ollama <=0.1.38

Sign in to join this conversation.

Branches Tags

main

parth-update-hermes-launch

parth-agent-system-prompt-cwd

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-fix-claude-model-picker

parth-api-status-context-length

docs/vscode-extension-setup

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#63497