[GH-ISSUE #10124] How to allocate more to the GPU? #68700

Closed
opened 2026-05-04 14:53:39 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @khteh on GitHub (Apr 4, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10124

$ ollama ps
NAME               ID              SIZE     PROCESSOR         UNTIL                   
llama3.3:latest    a6eb4748fd29    49 GB    93%/7% CPU/GPU    About a minute from now    
[Service]
Environment="OLLAMA_SCHED_SPREAD=true"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="OLLAMA_DEBUG=true"

nvtop:

Image

Originally created by @khteh on GitHub (Apr 4, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10124 ``` $ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.3:latest a6eb4748fd29 49 GB 93%/7% CPU/GPU About a minute from now ``` ``` [Service] Environment="OLLAMA_SCHED_SPREAD=true" Environment="OLLAMA_CONTEXT_LENGTH=8192" Environment="OLLAMA_DEBUG=true" ``` `nvtop`: ![Image](https://github.com/user-attachments/assets/4d343fcf-dcbd-4506-ac74-36333cbad661)
Author
Owner

@sieveLau commented on GitHub (Apr 4, 2025):

Please refer to this: https://github.com/ollama/ollama/issues/9818#issuecomment-2741331465

<!-- gh-comment-id:2778262294 --> @sieveLau commented on GitHub (Apr 4, 2025): Please refer to this: https://github.com/ollama/ollama/issues/9818#issuecomment-2741331465
Author
Owner

@khteh commented on GitHub (Apr 4, 2025):

I still don't understand what I need to set / change.

<!-- gh-comment-id:2778276058 --> @khteh commented on GitHub (Apr 4, 2025): I still don't understand what I need to set / change.
Author
Owner

@sieveLau commented on GitHub (Apr 4, 2025):

I assume you are using Nvidia and on Linux.

  1. Run the model and use nvidia-smi to see how much VRAM it eats and how much is free.
  2. Use ollama show --modelfile llama3.3:latest > llama3.3-big to create a modelfile
  3. Modify FROM according to the comment in the file, and add one line PARAMETER num_gpu 99 to the end of the file. You may modify this number according to your VRAM.
  4. Run ollama create -f llama3.3-big llama3.3-big, now you have a model named llama3.3-big which tries to load all layers on GPU.

I saw you set context length to 8192 for all models, and if using this num_gpu method to force loading layers to GPU, you may CUDA OOM. You may adjust accordingly.

I realized llama3.3 is 70B, so this is just an example on how to allocate more to GPU. You have to try the numbers of layers.

<!-- gh-comment-id:2778289905 --> @sieveLau commented on GitHub (Apr 4, 2025): I assume you are using Nvidia and on Linux. 1. Run the model and use `nvidia-smi` to see how much VRAM it eats and how much is free. 2. Use `ollama show --modelfile llama3.3:latest > llama3.3-big` to create a modelfile 3. Modify `FROM` according to the comment in the file, and add one line `PARAMETER num_gpu 99` to the end of the file. You may modify this number according to your VRAM. 4. Run `ollama create -f llama3.3-big llama3.3-big`, now you have a model named llama3.3-big which tries to load all layers on GPU. I saw you set context length to 8192 for all models, and if using this num_gpu method to force loading layers to GPU, you may CUDA OOM. You may adjust accordingly. I realized llama3.3 is 70B, so this is just an example on how to allocate more to GPU. You have to try the numbers of layers.
Author
Owner

@khteh commented on GitHub (Apr 4, 2025):

How to check the current num_gpu? From the picture, it seems that there are only 2 GPUS available. Isn't that the right number of num_gpu to set to?

<!-- gh-comment-id:2778364108 --> @khteh commented on GitHub (Apr 4, 2025): How to check the current `num_gpu`? From the picture, it seems that there are only 2 GPUS available. Isn't that the right number of `num_gpu` to set to?
Author
Owner

@sieveLau commented on GitHub (Apr 4, 2025):

How to check the current num_gpu? From the picture, it seems that there are only 2 GPUS available. Isn't that the right number of num_gpu to set to?

This parameter is not meaning the physical gpu number, it means the number of LAYERS of the model that you want them to load on the GPU. So you just start from a small number and increase if no OOM.

<!-- gh-comment-id:2778375941 --> @sieveLau commented on GitHub (Apr 4, 2025): > How to check the current `num_gpu`? From the picture, it seems that there are only 2 GPUS available. Isn't that the right number of `num_gpu` to set to? This parameter is not meaning the physical gpu number, it means the number of LAYERS of the model that you want them to load on the GPU. So you just start from a small number and increase if no OOM.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68700