[GH-ISSUE #5659] Using both CPU + GPU for Parallel Models #3529

Open
opened 2026-04-12 14:13:59 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @owenzhao on GitHub (Jul 13, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5659

Originally assigned to: @dhiltgen on GitHub.

As starting 0.2, Ollama support running in parallel models. That makes memories are more valuable than before. Even more, in operation system like Windows, if we you GPU, the memory of GPU is fixed unless we purchase another one.

Gladly, the system memory is much more cheaper than replacing the GPU and Ollama can work with CPU only with a lower speed. So if we could use both CPU + GPU at the same time, we can add more system memories and get more parallel models.

The approach should be like:

  1. Using GPU only, but use system memory as model cache. That will make switch model more quickly.
  2. Using GPU first, after the memories of GPU are full, using CPU as second method.

That all what I thought. More ideas are appreciated.

Originally created by @owenzhao on GitHub (Jul 13, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5659 Originally assigned to: @dhiltgen on GitHub. As starting 0.2, Ollama support running in parallel models. That makes memories are more valuable than before. Even more, in operation system like Windows, if we you GPU, the memory of GPU is fixed unless we purchase another one. Gladly, the system memory is much more cheaper than replacing the GPU and Ollama can work with CPU only with a lower speed. So if we could use both CPU + GPU at the same time, we can add more system memories and get more parallel models. The approach should be like: 1. Using GPU only, but use system memory as model cache. That will make switch model more quickly. 2. Using GPU first, after the memories of GPU are full, using CPU as second method. That all what I thought. More ideas are appreciated.
GiteaMirror added the feature request label 2026-04-12 14:13:59 -05:00
Author
Owner

@dhiltgen commented on GitHub (Jul 23, 2024):

We don't have a streamlined UX for this, however if you set num_gpu to 0 then this will force CPU mode for the model, and allow you to load it as a secondary model.

For example, on a system with an 8G GPU

% ollama run llama2:13b hello
...
% curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "hello?",
  "stream": false, "options": {"num_gpu": 0 }
}'
...
% ollama ps
NAME         	ID          	SIZE  	PROCESSOR      	UNTIL
llama3:latest	365c0bd3c000	5.2 GB	100% CPU       	4 minutes from now
llama2:13b   	d475bf4c50bc	9.9 GB	19%/81% CPU/GPU	4 minutes from now
<!-- gh-comment-id:2246364374 --> @dhiltgen commented on GitHub (Jul 23, 2024): We don't have a streamlined UX for this, however if you set `num_gpu` to `0` then this will force CPU mode for the model, and allow you to load it as a secondary model. For example, on a system with an 8G GPU ``` % ollama run llama2:13b hello ... % curl http://localhost:11434/api/generate -d '{ "model": "llama3", "prompt": "hello?", "stream": false, "options": {"num_gpu": 0 } }' ... % ollama ps NAME ID SIZE PROCESSOR UNTIL llama3:latest 365c0bd3c000 5.2 GB 100% CPU 4 minutes from now llama2:13b d475bf4c50bc 9.9 GB 19%/81% CPU/GPU 4 minutes from now ```
Author
Owner

@owenzhao commented on GitHub (Jul 24, 2024):

@dhiltgen Thanks for your reply. However, the method you provided needs to predefined to use the CPU. My suggestion is to use GPU and CPU dynamically.

<!-- gh-comment-id:2246786709 --> @owenzhao commented on GitHub (Jul 24, 2024): @dhiltgen Thanks for your reply. However, the method you provided needs to predefined to use the CPU. My suggestion is to use GPU and CPU dynamically.
Author
Owner

@inspire22 commented on GitHub (Nov 13, 2024):

I'm trying to get this working to help with parallel processing, but when I run the second command ollama ps shows that the other (CPU vs GPU) version was closed

curl http://localhost:11434/api/embed -d '{
  "model": "all-minilm", "options": {"num_gpu": 0 },
  "input": ["Why is the sky blue?","my second string here y ou go"]
}'
ollama ps shows:
NAME                 ID              SIZE     PROCESSOR    UNTIL
all-minilm:latest    1b226e2802db    25 MB    100% CPU     4 minutes from now

curl http://localhost:11434/api/embed -d '{
  "model": "all-minilm", "options": {"num_gpu": 1 },
  "input": ["Why is the sky blue?","my second string here y ou go"]
}'

ollama ps shows:
NAME                 ID              SIZE      PROCESSOR         UNTIL
all-minilm:latest    1b226e2802db    504 MB    4%/96% CPU/GPU    4 minutes from now

So there doesn't seem to be any improved throughput since it closes the other before opening the next.

<!-- gh-comment-id:2474450594 --> @inspire22 commented on GitHub (Nov 13, 2024): I'm trying to get this working to help with parallel processing, but when I run the second command `ollama ps` shows that the other (CPU vs GPU) version was closed ``` curl http://localhost:11434/api/embed -d '{ "model": "all-minilm", "options": {"num_gpu": 0 }, "input": ["Why is the sky blue?","my second string here y ou go"] }' ollama ps shows: NAME ID SIZE PROCESSOR UNTIL all-minilm:latest 1b226e2802db 25 MB 100% CPU 4 minutes from now curl http://localhost:11434/api/embed -d '{ "model": "all-minilm", "options": {"num_gpu": 1 }, "input": ["Why is the sky blue?","my second string here y ou go"] }' ollama ps shows: NAME ID SIZE PROCESSOR UNTIL all-minilm:latest 1b226e2802db 504 MB 4%/96% CPU/GPU 4 minutes from now ``` So there doesn't seem to be any improved throughput since it closes the other before opening the next.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3529