[GH-ISSUE #6879] Phi3:mini is using only cpu, llama3:8b is using cpu and gpu, want to enforce only gpu usage. #50860

Closed
opened 2026-04-28 17:16:22 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @KaloyanGeorgiev99 on GitHub (Sep 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6879

What is the issue?

These are my specs and also a comparison between two ollama models and a script how I run ollama locally, My questions are: Why is the phi3:mini only using the cpu and the ollama3:8b model is using both of them and additionally I would like to ask how to enforce only gpu usage on both models, I have tried the suggestions to set PATH variable to where the cuda dlls are in order to enforce GPU usage.

OLLAMA_SCRIPT:
ollamascript

CUDA:
NVIDIA_cuda

GPU:
nvidia

PROCESSOR:
12th Gen Intel(R) Core(TM) i7-12850HX, 2100 MHz, 16 Cores, 24 logical Processors

LLAMA3:8B:
llama3

PHI3:MINI:
phi3-mini

PHI3:MINI-OLLAMA OUTPUT:

ollamaoutput
ollamaoutput1

LLAMA3:8B-OLLAMA OUTPUT:
ollamaoutput2
ollamaoutput3

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.3.11

Originally created by @KaloyanGeorgiev99 on GitHub (Sep 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6879 ### What is the issue? These are my specs and also a comparison between two ollama models and a script how I run ollama locally, **_My questions are:_** Why is the phi3:mini only using the cpu and the ollama3:8b model is using both of them and additionally I would like to ask how to enforce only gpu usage on both models, I have tried the suggestions to set PATH variable to where the cuda dlls are in order to enforce GPU usage. OLLAMA_SCRIPT: ![ollamascript](https://github.com/user-attachments/assets/1b49ccd5-a1d3-42be-abb4-c8aeb860cb6b) CUDA: ![NVIDIA_cuda](https://github.com/user-attachments/assets/96d361a6-00c7-4990-aee7-f6ccc6721dd9) GPU: ![nvidia](https://github.com/user-attachments/assets/7a5c2fbb-f701-4160-93c7-e797079383ef) PROCESSOR: 12th Gen Intel(R) Core(TM) i7-12850HX, 2100 MHz, 16 Cores, 24 logical Processors LLAMA3:8B: ![llama3](https://github.com/user-attachments/assets/fff3fd9d-3ae9-4812-b73e-e489d28ea97d) PHI3:MINI: ![phi3-mini](https://github.com/user-attachments/assets/0642e51d-3c08-4a8b-bb71-a48bf8b850c0) PHI3:MINI-OLLAMA OUTPUT: [ ![ollamaoutput](https://github.com/user-attachments/assets/e65b2c6c-ec7d-4193-9514-2448de255a17) ![ollamaoutput1](https://github.com/user-attachments/assets/d743ef98-f29d-4476-ac41-43c88804518c) ](url) LLAMA3:8B-OLLAMA OUTPUT: ![ollamaoutput2](https://github.com/user-attachments/assets/f8080260-cbca-4606-985d-4af7772aff0b) ![ollamaoutput3](https://github.com/user-attachments/assets/74e4f43b-03ef-44fb-91ed-4b303ec868e3) ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.11
GiteaMirror added the bug label 2026-04-28 17:16:23 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 20, 2024):

There is no way you are going to get these models to be GPU only. You have 12G of VRAM on your GPU, llama3 wants 31G and phi3 wants 52G. This is driven by the context size you have chosen, 128K is consuming all the VRAM leaving little to none for model weights. You will be able to fit these models on GPU if you reduce context size and also terminate the other programs (ms-teams, SnagitEditor, TextInputHost) that are using the GPU.

<!-- gh-comment-id:2362542457 --> @rick-github commented on GitHub (Sep 20, 2024): There is no way you are going to get these models to be GPU only. You have 12G of VRAM on your GPU, llama3 wants 31G and phi3 wants 52G. This is driven by the context size you have chosen, 128K is consuming all the VRAM leaving little to none for model weights. You will be able to fit these models on GPU if you reduce context size and also terminate the other programs (ms-teams, SnagitEditor, TextInputHost) that are using the GPU.
Author
Owner

@xiaohan815 commented on GitHub (Sep 21, 2024):

There is no way you are going to get these models to be GPU only. You have 12G of VRAM on your GPU, llama3 wants 31G and phi3 wants 52G. This is driven by the context size you have chosen, 128K is consuming all the VRAM leaving little to none for model weights. You will be able to fit these models on GPU if you reduce context size and also terminate the other programs (ms-teams, SnagitEditor, TextInputHost) that are using the GPU.

can you tell me how to reduce the context size?

<!-- gh-comment-id:2365036162 --> @xiaohan815 commented on GitHub (Sep 21, 2024): > There is no way you are going to get these models to be GPU only. You have 12G of VRAM on your GPU, llama3 wants 31G and phi3 wants 52G. This is driven by the context size you have chosen, 128K is consuming all the VRAM leaving little to none for model weights. You will be able to fit these models on GPU if you reduce context size and also terminate the other programs (ms-teams, SnagitEditor, TextInputHost) that are using the GPU. can you tell me how to reduce the context size?
Author
Owner

@rick-github commented on GitHub (Sep 21, 2024):

The default context size is 2048, if the logs indicates it's more than that, then it's either because the model has been configured with a different context size, or because a client has set num_ctx in the API call.

If it's a model with a different context size, you can create a new model with a different size:

$ cat > Modelfile <<EOF
FROM model-with-large-context
PARAMETER num_ctx 2048
EOF
$ ollama create model-with-2k-context

If it's a client setting the size, you need to configure the client to use a smaller context size. If you supply more information about your client, we can offer more concrete advice on how to adjust the context size.

<!-- gh-comment-id:2365038676 --> @rick-github commented on GitHub (Sep 21, 2024): The default context size is 2048, if the logs indicates it's more than that, then it's either because the model has been configured with a different context size, or because a client has set `num_ctx` in the API call. If it's a model with a different context size, you can create a new model with a different size: ``` $ cat > Modelfile <<EOF FROM model-with-large-context PARAMETER num_ctx 2048 EOF $ ollama create model-with-2k-context ``` If it's a client setting the size, you need to configure the client to use a smaller context size. If you supply more information about your client, we can offer more concrete advice on how to adjust the context size.
Author
Owner

@KaloyanGeorgiev99 commented on GitHub (Sep 24, 2024):

Question was answered, closed the issue.

<!-- gh-comment-id:2370935038 --> @KaloyanGeorgiev99 commented on GitHub (Sep 24, 2024): Question was answered, closed the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50860