[GH-ISSUE #9041] How can I set Shared GPU Memory to less than 1G when creating a model using GGUF model in Ollama? #5884

Closed
opened 2026-04-12 17:13:11 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @shaken154 on GitHub (Feb 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9041

Computer configuration: CPU-i5 12400, memory -192G, graphics card -RTX2080TI22G, model: DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf.
Load the model with LM Studio software and set the parameters to: Context length is set to 8000, GPU unload is set to 39, CPU Thread Pool Size is set to 10, evaluation processing size is 512, after the model is loaded, in the "Task processor" - "Performance" - "GPU", "Dedicated GPU Memory" is 21.6GB, "Shared GPU memory" is 0.3G, "Temperature" is 0.7, "Top K sampling" is 40, "Repeat Penalty" is 1.1, and "Top P sampling" is 0.95.
However, using Ollama software to create the same GGUF model, load the model configuration file as follows:
FROM ./DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf
PARAMETER num_gpu 39
PARAMETER num_ctx 4096
PARAMETER temperature 0.7
PARAMETER num_predict 512
PARAMETER top_k 40
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.1
TEMPLATE "<|User|>{{ .System }} {{ .Prompt }}<|Assistant|>"
The model works, but "dedicated GPU memory" is 21.3GB and "shared GPU memory" is 20.6GB.
With the same GGUF model, can the Ollama software also run the effect of "shared GPU memory" less than 1G?

However, in the merged model: DeepSeek-R1-UD-IQ1_M.gguf, when num_gpu is set to 6, you can directly see that "shared GPU memory" is 0.3GB, and the rest is loaded into the computer memory.

These different model parameters are configured in different ways, what are the categories and how to configure them respectively? I searched the Internet for some Settings to configure environment variables, and it didn't work. Can you dedicate a page in docs to creating models and how to configure them? What other environment variables can I set? How to combine? It can make the average user with 8GB and 16GB graphics cards and a large capacity of computer memory (Max 192G) use the larger model as much as possible!

Originally created by @shaken154 on GitHub (Feb 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9041 Computer configuration: CPU-i5 12400, memory -192G, graphics card -RTX2080TI22G, model: DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf. Load the model with LM Studio software and set the parameters to: Context length is set to 8000, GPU unload is set to 39, CPU Thread Pool Size is set to 10, evaluation processing size is 512, after the model is loaded, in the "Task processor" - "Performance" - "GPU", "Dedicated GPU Memory" is 21.6GB, "Shared GPU memory" is 0.3G, "Temperature" is 0.7, "Top K sampling" is 40, "Repeat Penalty" is 1.1, and "Top P sampling" is 0.95. However, using Ollama software to create the same GGUF model, load the model configuration file as follows: FROM ./DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf PARAMETER num_gpu 39 PARAMETER num_ctx 4096 PARAMETER temperature 0.7 PARAMETER num_predict 512 PARAMETER top_k 40 PARAMETER top_p 0.95 PARAMETER repeat_penalty 1.1 TEMPLATE "<|User|>{{ .System }} {{ .Prompt }}<|Assistant|>" The model works, but "dedicated GPU memory" is 21.3GB and "shared GPU memory" is 20.6GB. With the same GGUF model, can the Ollama software also run the effect of "shared GPU memory" less than 1G? However, in the merged model: DeepSeek-R1-UD-IQ1_M.gguf, when num_gpu is set to 6, you can directly see that "shared GPU memory" is 0.3GB, and the rest is loaded into the computer memory. These different model parameters are configured in different ways, what are the categories and how to configure them respectively? I searched the Internet for some Settings to configure environment variables, and it didn't work. Can you dedicate a page in docs to creating models and how to configure them? What other environment variables can I set? How to combine? It can make the average user with 8GB and 16GB graphics cards and a large capacity of computer memory (Max 192G) use the larger model as much as possible!
Author
Owner

@rick-github commented on GitHub (Feb 12, 2025):

Server logs will show why ollama is using more RAM. At a guess, you haven't set OLLAMA_NUM_PARALLEL=1 so ollama is using a default of 4, so it's using quadruple the amount of context memory that LM Studio is using.

<!-- gh-comment-id:2653556704 --> @rick-github commented on GitHub (Feb 12, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will show why ollama is using more RAM. At a guess, you haven't set `OLLAMA_NUM_PARALLEL=1` so ollama is using a default of 4, so it's using quadruple the amount of context memory that LM Studio is using.
Author
Owner

@shaken154 commented on GitHub (Feb 12, 2025):

Server logs will show why ollama is using more RAM. At a guess, you haven't set so ollama is using a default of 4, so it's using quadruple the amount of context memory that LM Studio is using.OLLAMA_NUM_PARALLEL=1

In the environment variable of system properties, add: OLLAMA_KEEP_ALIVE=-1, OLLAMA_NUM_PARALLEL=1, OLLAMA_SHAREDMEM=1, OLLAMA_GPUMEM=21, OLLAMA_GPU_OVERHEAD= 1073741824 , OLLAMA_GPU_LAYER= auto_split.
Restarting Ollama is useless.
Where is the official tutorial for configuring environment variables?

<!-- gh-comment-id:2653606287 --> @shaken154 commented on GitHub (Feb 12, 2025): > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will show why ollama is using more RAM. At a guess, you haven't set so ollama is using a default of 4, so it's using quadruple the amount of context memory that LM Studio is using.`OLLAMA_NUM_PARALLEL=1` In the environment variable of system properties, add: OLLAMA_KEEP_ALIVE=-1, OLLAMA_NUM_PARALLEL=1, OLLAMA_SHAREDMEM=1, OLLAMA_GPUMEM=21, OLLAMA_GPU_OVERHEAD= 1073741824 , OLLAMA_GPU_LAYER= auto_split. Restarting Ollama is useless. Where is the official tutorial for configuring environment variables?
Author
Owner

@rick-github commented on GitHub (Feb 12, 2025):

Where are OLLAMA_SHAREDMEM, OLLAMA_GPUMEM and OLLAMA_GPU_LAYER from? Why have you set OLLAMA_GPU_OVERHEAD?

<!-- gh-comment-id:2653618539 --> @rick-github commented on GitHub (Feb 12, 2025): Where are OLLAMA_SHAREDMEM, OLLAMA_GPUMEM and OLLAMA_GPU_LAYER from? Why have you set [`OLLAMA_GPU_OVERHEAD`](https://github.com/ollama/ollama/blob/49df03da9af6b0050ebbf50676f7db569a2b54d9/envconfig/config.go#L239)?
Author
Owner

@shaken154 commented on GitHub (Feb 12, 2025):

Where are OLLAMA_SHAREDMEM, OLLAMA_GPUMEM and OLLAMA_GPU_LAYER from? Why have you set OLLAMA_GPU_OVERHEAD?

Now only this is left in the "Environment variables" : "OLLAMA_KEEP_ALIVE=-1",
How can Ollama only use less than 1G of "shared GPU memory"? It is best to implement this feature at the time of model creation, as not every model needs to be set up with "shared GPU memory".
Why can LM Studio software just find the right number of GPU unload layers when loading the model, and then it can be used directly?
If it weren't for the fact that some AI tools only support Ollama, I would only use LM Studio.
Why I want to solve this problem, because I feel that the "DeepSeek-R1-70B" is too slow to use a lot of "shared GPU memory", and I use the "DeepSeek-R1-671b-1.73bit" about the same speed, but the size difference between the two is so large, so the "70B" is not set well.

<!-- gh-comment-id:2653747541 --> @shaken154 commented on GitHub (Feb 12, 2025): > Where are OLLAMA_SHAREDMEM, OLLAMA_GPUMEM and OLLAMA_GPU_LAYER from? Why have you set [`OLLAMA_GPU_OVERHEAD`](https://github.com/ollama/ollama/blob/49df03da9af6b0050ebbf50676f7db569a2b54d9/envconfig/config.go#L239)? Now only this is left in the "Environment variables" : "OLLAMA_KEEP_ALIVE=-1", How can Ollama only use less than 1G of "shared GPU memory"? It is best to implement this feature at the time of model creation, as not every model needs to be set up with "shared GPU memory". Why can LM Studio software just find the right number of GPU unload layers when loading the model, and then it can be used directly? If it weren't for the fact that some AI tools only support Ollama, I would only use LM Studio. Why I want to solve this problem, because I feel that the "DeepSeek-R1-70B" is too slow to use a lot of "shared GPU memory", and I use the "DeepSeek-R1-671b-1.73bit" about the same speed, but the size difference between the two is so large, so the "70B" is not set well.
Author
Owner

@rick-github commented on GitHub (Feb 12, 2025):

It is best to implement this feature at the time of model creation, as not every model needs to be set up with "shared GPU memory".

It can't be set at time of model creation, it depends on the GPU it's being loaded in to. "Shared GPU memory" is a feature of Nvidia cards that allows the GPU to use system RAM if VRAM is full.

Why can LM Studio software just find the right number of GPU unload layers when loading the model, and then it can be used directly?

ollama can also find the right number of GPU layers to offload, but you overrode that when you put PARAMETER num_gpu 39 in the Modelfile.

Why I want to solve this problem, because I feel that the "DeepSeek-R1-70B" is too slow to use a lot of "shared GPU memory"

If we could see what ollama was doing, we could help you fix your problem. It's too bad that there isn't some sort of log that contains details of how ollama is calculating memory and assigning layers.

<!-- gh-comment-id:2653804830 --> @rick-github commented on GitHub (Feb 12, 2025): > It is best to implement this feature at the time of model creation, as not every model needs to be set up with "shared GPU memory". It can't be set at time of model creation, it depends on the GPU it's being loaded in to. "Shared GPU memory" is a feature of Nvidia cards that allows the GPU to use system RAM if VRAM is full. > Why can LM Studio software just find the right number of GPU unload layers when loading the model, and then it can be used directly? ollama can also find the right number of GPU layers to offload, but you overrode that when you put `PARAMETER num_gpu 39` in the Modelfile. > Why I want to solve this problem, because I feel that the "DeepSeek-R1-70B" is too slow to use a lot of "shared GPU memory" If we could see what ollama was doing, we could help you fix your problem. It's too bad that there isn't some sort of log that contains details of how ollama is calculating memory and assigning layers.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5884