[GH-ISSUE #6876] Why models don't use full CPU power? #66383

Closed
opened 2026-05-04 03:23:24 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @iladshyan on GitHub (Sep 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6876

I have noticed in CPU only use cases the models are not using the CPU to the full potential. Are there any way to make the utilize the full power?

Originally created by @iladshyan on GitHub (Sep 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6876 I have noticed in CPU only use cases the models are not using the CPU to the full potential. Are there any way to make the utilize the full power?
Author
Owner

@wrapss commented on GitHub (Sep 19, 2024):

try increasing the num_thread options https://github.com/ollama/ollama/blob/main/docs/api.md#request-7

<!-- gh-comment-id:2360703593 --> @wrapss commented on GitHub (Sep 19, 2024): try increasing the num_thread options https://github.com/ollama/ollama/blob/main/docs/api.md#request-7
Author
Owner

@iladshyan commented on GitHub (Sep 19, 2024):

try increasing the num_thread options https://github.com/ollama/ollama/blob/main/docs/api.md#request-7

Is there way to do that in the CLI?

<!-- gh-comment-id:2360948520 --> @iladshyan commented on GitHub (Sep 19, 2024): > try increasing the num_thread options https://github.com/ollama/ollama/blob/main/docs/api.md#request-7 Is there way to do that in the CLI?
Author
Owner

@wrapss commented on GitHub (Sep 19, 2024):

Is there way to do that in the CLI?

/set parameter num_thread x

<!-- gh-comment-id:2360953785 --> @wrapss commented on GitHub (Sep 19, 2024): > Is there way to do that in the CLI? /set parameter num_thread x
Author
Owner

@iladshyan commented on GitHub (Sep 19, 2024):

Is there way to do that in the CLI?

/set parameter num_thread x

Thank you so much for the fast reply. I'll check it out.

<!-- gh-comment-id:2360957623 --> @iladshyan commented on GitHub (Sep 19, 2024): > > > > Is there way to do that in the CLI? > > > > /set parameter num_thread x Thank you so much for the fast reply. I'll check it out.
Author
Owner

@iladshyan commented on GitHub (Sep 20, 2024):

Is there way to do that in the CLI?

/set parameter num_thread x

Well it is giving me an error:
'/set' is not recognized as an internal or external command,
operable program or batch file.

set num_thread = 4 is not giving me an error but not working.

I'm in windows btw (The command doesn't work in WSL too)

<!-- gh-comment-id:2362794183 --> @iladshyan commented on GitHub (Sep 20, 2024): > > Is there way to do that in the CLI? > > /set parameter num_thread x Well it is giving me an error: '/set' is not recognized as an internal or external command, operable program or batch file. set num_thread = 4 is not giving me an error but not working. I'm in windows btw (The command doesn't work in WSL too)
Author
Owner

@iladshyan commented on GitHub (Sep 20, 2024):

I found this in the logs if it help

INFO [wmain] system info | n_threads=2 n_threads_batch=2 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="2296" timestamp=1726806830 total_threads=4

<!-- gh-comment-id:2362801729 --> @iladshyan commented on GitHub (Sep 20, 2024): I found this in the logs if it help INFO [wmain] system info | n_threads=2 n_threads_batch=2 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="2296" timestamp=1726806830 total_threads=4
Author
Owner

@rick-github commented on GitHub (Sep 20, 2024):

You run the /set command inside the ollama CLI:

C:\> ollama run llama3.1:8b
>>> /set parameter num_thread 16
Set parameter 'num_thread' to '16'
>>> hello
Hello! How can I assist you today?

>>> /bye

Or you can set it in the API:

$ curl -s localhost:11434/api/generate -d '{"model":"llama3.1:8b","options":{"num_thread":16},"prompt":"hello","stream":false}' | jq .response
"Hello! How are you today? Is there something I can help you with or would you like to chat?"

Or you can create new model with a different default thread count (linux commands but the same principle in windows):

$ echo FROM llama3.1:8b > Modelfile
$ echo PARAMETER num_thread 16 >> Modelfile
$ ollama create llama3.1:8b-16t
$ ollama run llama3.1:8b-16t hello
Hello! How are you today? Is there something I can help you with, or would you like to chat?

$ docker logs ollama-2>&1 | grep -e --threads
time=2024-09-20T11:38:32.793Z level=INFO source=server.go:388 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --threads 16 --parallel 1 --port 38907"
<!-- gh-comment-id:2363548354 --> @rick-github commented on GitHub (Sep 20, 2024): You run the `/set` command inside the ollama CLI: ```console C:\> ollama run llama3.1:8b >>> /set parameter num_thread 16 Set parameter 'num_thread' to '16' >>> hello Hello! How can I assist you today? >>> /bye ``` Or you can set it in the API: ```console $ curl -s localhost:11434/api/generate -d '{"model":"llama3.1:8b","options":{"num_thread":16},"prompt":"hello","stream":false}' | jq .response "Hello! How are you today? Is there something I can help you with or would you like to chat?" ``` Or you can create new model with a different default thread count (linux commands but the same principle in windows): ```console $ echo FROM llama3.1:8b > Modelfile $ echo PARAMETER num_thread 16 >> Modelfile $ ollama create llama3.1:8b-16t $ ollama run llama3.1:8b-16t hello Hello! How are you today? Is there something I can help you with, or would you like to chat? $ docker logs ollama-2>&1 | grep -e --threads time=2024-09-20T11:38:32.793Z level=INFO source=server.go:388 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --threads 16 --parallel 1 --port 38907" ```
Author
Owner

@iladshyan commented on GitHub (Sep 20, 2024):

You run the /set command inside the ollama CLI:


C:\> ollama run llama3.1:8b

>>> /set parameter num_thread 16

Set parameter 'num_thread' to '16'

>>> hello

Hello! How can I assist you today?



>>> /bye

Or you can set it in the API:


$ curl -s localhost:11434/api/generate -d '{"model":"llama3.1:8b","options":{"num_thread":16},"prompt":"hello","stream":false}' | jq .response

"Hello! How are you today? Is there something I can help you with or would you like to chat?"

Or you can create new model with a different default thread count (linux commands but the same principle in windows):


$ cat > Modelfile <<EOF

FROM llama3.1:8b

PARAMETER num_thread 16

EOF

$ ollama create llama3.1:8b-16t

$ ollama run llama3.1:8b-16t hello

Hello! How are you today? Is there something I can help you with, or would you like to chat?



$ docker logs ollama-2>&1 | grep -e --threads

time=2024-09-20T11:38:32.793Z level=INFO source=server.go:388 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --threads 16 --parallel 1 --port 38907"

This works thanks!

<!-- gh-comment-id:2363562250 --> @iladshyan commented on GitHub (Sep 20, 2024): > You run the `/set` command inside the ollama CLI: > > ```console > > C:\> ollama run llama3.1:8b > > >>> /set parameter num_thread 16 > > Set parameter 'num_thread' to '16' > > >>> hello > > Hello! How can I assist you today? > > > > >>> /bye > > ``` > > Or you can set it in the API: > > ```console > > $ curl -s localhost:11434/api/generate -d '{"model":"llama3.1:8b","options":{"num_thread":16},"prompt":"hello","stream":false}' | jq .response > > "Hello! How are you today? Is there something I can help you with or would you like to chat?" > > ``` > > Or you can create new model with a different default thread count (linux commands but the same principle in windows): > > ```console > > $ cat > Modelfile <<EOF > > FROM llama3.1:8b > > PARAMETER num_thread 16 > > EOF > > $ ollama create llama3.1:8b-16t > > $ ollama run llama3.1:8b-16t hello > > Hello! How are you today? Is there something I can help you with, or would you like to chat? > > > > $ docker logs ollama-2>&1 | grep -e --threads > > time=2024-09-20T11:38:32.793Z level=INFO source=server.go:388 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --threads 16 --parallel 1 --port 38907" > > ``` This works thanks!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66383