[GH-ISSUE #1101] [Question]: Use all CPU resource from Docker CPU image #78226

Closed
opened 2026-05-08 22:19:37 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @LWJerri on GitHub (Nov 12, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1101

Hi. I have a dedicated server with an Intel® Core™ i5-13500 processor (more info here). But Ollama uses only ~50% of all power. What do I need to do to use all CPU resources? I'm using Docker to run Ollama, here is my docker-compose.yaml:

version: "3.7"

services:
  api-ollama:
    restart: always
    image: ollama/ollama:latest
    networks:
      - caddy
    volumes:
      - ollama:/root/.ollama
    labels:
      caddy: api.ollama.main.lwjerri.dev
      caddy.reverse_proxy: "{{upstreams 11434}}"

volumes:
  ollama:

networks:
  caddy:
    external: true

image

Thanks for any help <3

Originally created by @LWJerri on GitHub (Nov 12, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1101 Hi. I have a dedicated server with an **Intel® Core™ i5-13500** processor ([more info here](https://www.hetzner.com/dedicated-rootserver/ex44)). But Ollama uses only ~50% of all power. What do I need to do to use all CPU resources? I'm using Docker to run Ollama, here is my `docker-compose.yaml`: ```yaml version: "3.7" services: api-ollama: restart: always image: ollama/ollama:latest networks: - caddy volumes: - ollama:/root/.ollama labels: caddy: api.ollama.main.lwjerri.dev caddy.reverse_proxy: "{{upstreams 11434}}" volumes: ollama: networks: caddy: external: true ``` ![image](https://github.com/jmorganca/ollama/assets/50290430/5941f603-bdd4-49a4-9483-5ab590c0e502) Thanks for any help <3
Author
Owner

@easp commented on GitHub (Nov 17, 2023):

First, I think by default, ollama limits CPU threads to an ~optimal value. This can be changed in the modelfile. However, you may want to decrease it, rather than increase it.

That CPU has 6 performance cores, each with two virtual cores and 8 efficiency cores. Virtual cores are most useful when running workloads that have multiple threads or processes that are waiting for cache misses to be fullfilled from higher level caches or main memory. As I understand it, the memory access pattern of generation/inference in LLMs is a series of long sequential reads of memory, and this is limited primarily by bandwidth to main memory. Having multiple virtual cores will just cause contention and reduce efficiency. The same is true for the efficiency cores; if the performance cores can saturate the available memory bandwidth (as they likely can) then the efficiency cores will just cause contention and reduce efficiency. Finally, there can be issues scheduling multiple threads across dissimilar cores that cause the performance cores to wait for the efficiency cores -- not sure whether or not this is a significant issue for Ollama/llama.cpp.

If cores are waiting for memory access I believe they will show as being 100% utilized in most high-level performance dashboards, so that's not going to be a good indication of whether or not you are maximizing your use of the available resources. You'd need something that looks at more granular performance counters.

Ollama tries to pick a thread count that will give optimal performance. If you think you can do better, what I'd do is use ollama run MODEL --verbose. to get a measure of the "eval rate" across multiple runs. Then I'd adjust the num_threads and see if the eval rate increases or decreases. Actually, what I'd probably do is leave it alone and trust that the setting was good enough and spend my time on other things.

<!-- gh-comment-id:1816735620 --> @easp commented on GitHub (Nov 17, 2023): First, I think by default, ollama limits CPU threads to an ~optimal value. This can be changed in the modelfile. However, you may want to decrease it, rather than increase it. That CPU has 6 performance cores, each with two virtual cores and 8 efficiency cores. Virtual cores are most useful when running workloads that have multiple threads or processes that are waiting for cache misses to be fullfilled from higher level caches or main memory. As I understand it, the memory access pattern of generation/inference in LLMs is a series of long sequential reads of memory, and this is limited primarily by bandwidth to main memory. Having multiple virtual cores will just cause contention and reduce efficiency. The same is true for the efficiency cores; if the performance cores can saturate the available memory bandwidth (as they likely can) then the efficiency cores will just cause contention and reduce efficiency. Finally, there can be issues scheduling multiple threads across dissimilar cores that cause the performance cores to wait for the efficiency cores -- not sure whether or not this is a significant issue for Ollama/llama.cpp. If cores are waiting for memory access I believe they will show as being 100% utilized in most high-level performance dashboards, so that's not going to be a good indication of whether or not you are maximizing your use of the available resources. You'd need something that looks at more granular performance counters. Ollama tries to pick a thread count that will give optimal performance. If you think you can do better, what I'd do is use ollama run MODEL --verbose. to get a measure of the "eval rate" across multiple runs. Then I'd adjust the num_threads and see if the eval rate increases or decreases. Actually, what I'd probably do is leave it alone and trust that the setting was good enough and spend my time on other things.
Author
Owner

@m0wer commented on GitHub (Dec 20, 2023):

Right. Thanks for the explanation!

With a 6C/12T CPU, the default number of threads is 6. Which with partial GPU offloading (but still CPU bottleneck) I get 15 t/s. With num_thread 12 in the model, it drops to 3 t/s.

<!-- gh-comment-id:1864284144 --> @m0wer commented on GitHub (Dec 20, 2023): Right. Thanks for the explanation! With a 6C/12T CPU, the default number of threads is 6. Which with partial GPU offloading (but still CPU bottleneck) I get 15 t/s. With `num_thread 12` in the model, it drops to 3 t/s.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#78226