[GH-ISSUE #6407] vulkan: Please add an easy way to automatically load only layers that can fit into the GPU #50536

Closed
opened 2026-04-28 16:15:32 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @yurivict on GitHub (Aug 18, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6407

Originally assigned to: @dhiltgen on GitHub.

There are models that can only fit partially into the GPU.

Currently such models present the difficulty that the user needs to know upfront how many layers would fit into the GPU.

I use this script to add the "num_gpu" parameter:

#!/bin/sh

MODEL=$1
NUM_GPU=$2

if [ -z "$MODEL" ] || [ -z "$NUM_GPU" ]; then
        echo "Usage: $0 <model> <num_gpu>"
fi

ollama show --modelfile $MODEL > Modelfile &&
echo "PARAMETER num_gpu $NUM_GPU" >> Modelfile &&
ollama create "$MODEL-num_gpu$NUM_GPU" -f Modelfile &&
echo "model variant $MODEL-num_gpu$NUM_GPU was created"

This is difficult to do because the user needs to run this function many times to find the maximum num_gpu value that would fit into the GPU.

Could you please add a run-time option (for example OLLAMA_NUM_GPU=auto) that would automatically determine how many layers, or what combination of layers would fit into the GPU, and would fit as many layers as possible into the GPU?

Originally created by @yurivict on GitHub (Aug 18, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6407 Originally assigned to: @dhiltgen on GitHub. There are models that can only fit partially into the GPU. Currently such models present the difficulty that the user needs to know upfront how many layers would fit into the GPU. I use this script to add the "num_gpu" parameter: ``` #!/bin/sh MODEL=$1 NUM_GPU=$2 if [ -z "$MODEL" ] || [ -z "$NUM_GPU" ]; then echo "Usage: $0 <model> <num_gpu>" fi ollama show --modelfile $MODEL > Modelfile && echo "PARAMETER num_gpu $NUM_GPU" >> Modelfile && ollama create "$MODEL-num_gpu$NUM_GPU" -f Modelfile && echo "model variant $MODEL-num_gpu$NUM_GPU was created" ``` This is difficult to do because the user needs to run this function many times to find the maximum num_gpu value that would fit into the GPU. Could you please add a run-time option (for example OLLAMA_NUM_GPU=auto) that would automatically determine how many layers, or what combination of layers would fit into the GPU, and would fit as many layers as possible into the GPU?
GiteaMirror added the feature request label 2026-04-28 16:15:32 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 18, 2024):

This is how ollama is supposed to work, it calculates how many layers will fit on the GPU and passes that information to the runner, search for lines in your log that contain msg="offload to cuda".

If you find that this is not working automatically, server logs will aid in debugging the issue

<!-- gh-comment-id:2295418824 --> @rick-github commented on GitHub (Aug 18, 2024): This is how ollama is supposed to work, it calculates how many layers will fit on the GPU and passes that information to the runner, search for lines in your log that contain `msg="offload to cuda"`. If you find that this is not working automatically, [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging the issue
Author
Owner

@yurivict commented on GitHub (Aug 18, 2024):

Here is the ollama server log.

It first says:

time=2024-08-18T16:14:19.518-07:00 level=INFO source=memory.go:309 msg="offload to vulkan" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[6.2 GiB]" memory.required.full="6.0 GiB" memory.required.partial="6.0 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.0 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.6 GiB" memory.weights.nonrepeating="105.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB"

And then it fails to allocate memory:

ggml_vulkan: Device memory allocation of size 1073741824 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory

Why does it fail?

<!-- gh-comment-id:2295430586 --> @yurivict commented on GitHub (Aug 18, 2024): [Here](https://freebsd.org/~yuri/ollama-server-fails-on-Vulkan-GPU.log) is the ollama server log. It first says: > time=2024-08-18T16:14:19.518-07:00 level=INFO source=memory.go:309 msg="offload to vulkan" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[6.2 GiB]" memory.required.full="6.0 GiB" memory.required.partial="6.0 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.0 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.6 GiB" memory.weights.nonrepeating="105.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB" And then it fails to allocate memory: > ggml_vulkan: Device memory allocation of size 1073741824 failed. > ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory Why does it fail?
Author
Owner

@rick-github commented on GitHub (Aug 18, 2024):

That would be a question for the maintainer of the vulkan driver.

It fails allocating the KV area, which is used for the context window. Currently you are using the default context size of 2048 tokens, but because you haven't set OLLAMA_NUM_PARALLEL, ollama is using a default of 4, which quadruples the size of the context window that llama.cpp allocates, see --ctx-size in the logs. You may be able to mitigate some of the memory pressure by explicitly setting OLLAMA_NUM_PARALLEL=1 in the server environment.

Note that there is some work being done on optimizing the memory calculations - there have been a few tickets filed recently which indicate that ollama is overly aggressive in offloading layers.

<!-- gh-comment-id:2295441111 --> @rick-github commented on GitHub (Aug 18, 2024): That would be a question for the maintainer of the vulkan driver. It fails allocating the KV area, which is used for the context window. Currently you are using the default context size of 2048 tokens, but because you haven't set `OLLAMA_NUM_PARALLEL`, ollama is using a default of 4, which quadruples the size of the context window that llama.cpp allocates, see `--ctx-size` in the logs. You may be able to mitigate some of the memory pressure by explicitly setting `OLLAMA_NUM_PARALLEL=1` in the server environment. Note that there is some work being done on optimizing the memory calculations - there have been a few tickets filed recently which indicate that ollama is overly aggressive in offloading layers.
Author
Owner

@rick-github commented on GitHub (Aug 19, 2024):

I loaded the same model on my test machine (rtx4070) and the requirements are close to your machine:

ollama  | time=2024-08-18T23:43:21.391Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1
layers.model=33 layers.offload=33 layers.split="" memory.available="[11.5 GiB]" memory.required.full="5.9 GiB"
memory.required.partial="5.9 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[5.9 GiB]" 
memory.weights.total="4.7 GiB" memory.weights.repeating="4.6 GiB" memory.weights.nonrepeating="105.0 MiB" 
memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB"

I'm not familiar with the vulkan code, perhaps it's over reporting the available memory or the actual memory allocation inside the driver has some unaccounted overhead.

<!-- gh-comment-id:2295454551 --> @rick-github commented on GitHub (Aug 19, 2024): I loaded the same model on my test machine (rtx4070) and the requirements are close to your machine: ``` ollama | time=2024-08-18T23:43:21.391Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[11.5 GiB]" memory.required.full="5.9 GiB" memory.required.partial="5.9 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[5.9 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.6 GiB" memory.weights.nonrepeating="105.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB" ``` I'm not familiar with the vulkan code, perhaps it's over reporting the available memory or the actual memory allocation inside the driver has some unaccounted overhead.
Author
Owner

@yurivict commented on GitHub (Aug 19, 2024):

The mistral model (4.1 GB) works fine with OLLAMA_NUM_PARALLEL=1 - 100% in the GPU.

But the larger everythinglm (7.4 GB) model still fails: here is the log.

<!-- gh-comment-id:2295488159 --> @yurivict commented on GitHub (Aug 19, 2024): The mistral model (4.1 GB) works fine with OLLAMA_NUM_PARALLEL=1 - 100% in the GPU. But the larger everythinglm (7.4 GB) model still fails: [here](https://freebsd.org/~yuri/ollama-server-out-of-memory-everythinglm.log) is the log.
Author
Owner

@rick-github commented on GitHub (Aug 19, 2024):

time=2024-08-18T18:03:37.294-07:00 level=DEBUG source=server.go:101 msg="system memory" total="24.0 GiB" free="937.6 MiB" free_swap="64.0 GiB"
time=2024-08-18T18:03:37.295-07:00 level=INFO source=memory.go:309 msg="offload to vulkan" layers.requested=-1 
layers.model=41 layers.offload=25 layers.split="" memory.available="[6.2 GiB]" memory.required.full="9.3 GiB" 
memory.required.partial="6.1 GiB" memory.required.kv="1.6 GiB" memory.required.allocations="[6.1 GiB]" 
memory.weights.total="8.2 GiB" memory.weights.repeating="8.1 GiB" memory.weights.nonrepeating="128.2 MiB" memory.graph.full="204.0 MiB" memory.graph.partial="244.1 MiB"
ggml_vulkan: Failed to allocate pinned memory.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
ggml_vulkan: Device memory allocation of size 1048576000 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory

You have multiple memory problems here. The first, Failed to allocate pinned memory means the vulkan driver is unable to allocate memory on the host, likely because the system has only 937.6MiB free.

Second, Vulkan driver says 6.2G available, ollama allocates 6.1G, vulkan out of memory. This may be the situation I mentioned where ollama is too aggressive in offloading layers. Until the calculations are adjusted, you can use 'PARAMETER num_gpu 24` to relieve the memory pressure.

<!-- gh-comment-id:2296835317 --> @rick-github commented on GitHub (Aug 19, 2024): ``` time=2024-08-18T18:03:37.294-07:00 level=DEBUG source=server.go:101 msg="system memory" total="24.0 GiB" free="937.6 MiB" free_swap="64.0 GiB" time=2024-08-18T18:03:37.295-07:00 level=INFO source=memory.go:309 msg="offload to vulkan" layers.requested=-1 layers.model=41 layers.offload=25 layers.split="" memory.available="[6.2 GiB]" memory.required.full="9.3 GiB" memory.required.partial="6.1 GiB" memory.required.kv="1.6 GiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="8.2 GiB" memory.weights.repeating="8.1 GiB" memory.weights.nonrepeating="128.2 MiB" memory.graph.full="204.0 MiB" memory.graph.partial="244.1 MiB" ggml_vulkan: Failed to allocate pinned memory. ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory ggml_vulkan: Device memory allocation of size 1048576000 failed. ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory ``` You have multiple memory problems here. The first, `Failed to allocate pinned memory` means the vulkan driver is unable to allocate memory on the host, likely because the system has only 937.6MiB free. Second, Vulkan driver says 6.2G available, ollama allocates 6.1G, vulkan out of memory. This may be the situation I mentioned where ollama is too aggressive in offloading layers. Until the calculations are adjusted, you can use 'PARAMETER num_gpu 24` to relieve the memory pressure.
Author
Owner

@dhiltgen commented on GitHub (Sep 5, 2024):

Ollama currently does not support Vulkan as a backend for llama.cpp. Part of enabling that support would involve updating our memory prediction logic to ensure we load the optimal number of layers on the available VRAM in the GPU.

Adding vulkan support is tracked via #2033

<!-- gh-comment-id:2332343156 --> @dhiltgen commented on GitHub (Sep 5, 2024): Ollama currently does not support Vulkan as a backend for llama.cpp. Part of enabling that support would involve updating our memory prediction logic to ensure we load the optimal number of layers on the available VRAM in the GPU. Adding vulkan support is tracked via #2033
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50536