[GH-ISSUE #12010] Feature: (Re)introduce functionality for manually overriding layer splitting and GPU offload decisions #33735

Closed
opened 2026-04-22 16:41:42 -05:00 by GiteaMirror · 22 comments
Owner

Originally created by @gordan-bobic on GitHub (Aug 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12010

Removal of --tensor-split on 0.11.5 is a MASSIVE leap backward. The allocation split calibration heuristic is beyond terrible. With multiple GPUs and a large model, the heuristic split results in context size with full GPU offload being as low as 50% of what can otherwise be achieved.

Example:
Model: llama3.2-vision:90b, 101 layers
VRAM: 4x 22GB
layer-split (with 0.11.4): 26,25,25,25
Maximum num_ctx length before 1 layer is moved to CPU: 14094

This is vastly sub-optimal. Overriding the layer split to 24,27,27,23 we can achieve a fully populated context size with num_ctx ~29700 without CPU offload and without OOM.
This is not a small difference, this is more than 2x, and with default split 15-16GB of VRAM remains unused.

Without the layer-split being passed, this is difficult to override.

If you are going to insist on removing the --layer-split paremeter, then at the very least make it overridable configurable in some way.

Difference bewen 14,000 and 29,000 of usable context on same hardware is not a small difference, it is a difference between unusuable for most tasks and comfortably usable for most tasks.

Originally created by @gordan-bobic on GitHub (Aug 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12010 Removal of --tensor-split on 0.11.5 is a MASSIVE leap backward. The allocation split calibration heuristic is beyond terrible. With multiple GPUs and a large model, the heuristic split results in context size with full GPU offload being as low as 50% of what can otherwise be achieved. Example: Model: llama3.2-vision:90b, 101 layers VRAM: 4x 22GB layer-split (with 0.11.4): 26,25,25,25 Maximum num_ctx length before 1 layer is moved to CPU: 14094 This is vastly sub-optimal. Overriding the layer split to `24,27,27,23` we can achieve a fully populated context size with num_ctx ~29700 without CPU offload and without OOM. This is not a small difference, this is more than 2x, and with default split 15-16GB of VRAM remains unused. Without the layer-split being passed, this is difficult to override. If you are going to insist on removing the --layer-split paremeter, then at the very least make it overridable configurable in some way. Difference bewen 14,000 and 29,000 of usable context on same hardware is not a small difference, it is a difference between unusuable for most tasks and comfortably usable for most tasks.
GiteaMirror added the feature requestmemory labels 2026-04-22 16:41:42 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 21, 2025):

0.11.5 changes how a model is scheduled across multiple GPUs with a view to decreasing memory overhead and power consumption. Server logs may aid in debugging.

<!-- gh-comment-id:3210942925 --> @rick-github commented on GitHub (Aug 21, 2025): 0.11.5 [changes](https://github.com/ollama/ollama/pull/10678) how a model is scheduled across multiple GPUs with a view to decreasing memory overhead and power consumption. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@gordan-bobic commented on GitHub (Aug 21, 2025):

Log shows it splitting the exact same way as 0.11.4 for this model, 26,25,25,25. 0.11.5 behaves no differently in terms of split in this case, except it takes away my ability to override it using an injected redneck script that gets it to optimal settings.

Is there some other way/place to override it?

<!-- gh-comment-id:3211082676 --> @gordan-bobic commented on GitHub (Aug 21, 2025): Log shows it splitting the exact same way as 0.11.4 for this model, 26,25,25,25. 0.11.5 behaves no differently in terms of split in this case, except it takes away my ability to override it using an injected redneck script that gets it to optimal settings. Is there some other way/place to override it?
Author
Owner

@rick-github commented on GitHub (Aug 21, 2025):

Is there some other way/place to override it?

Unfortunately not currently.

The preferred solution would be to have ollama find the optimal layer allocation so that redneck scripts are not required. Server logs (preferably with OLLAMA_DEBUG=2 to see layer assignments, this would also include prompts that might need redacting) would help in debugging.

<!-- gh-comment-id:3211123898 --> @rick-github commented on GitHub (Aug 21, 2025): > Is there some other way/place to override it? Unfortunately not currently. The preferred solution would be to have ollama find the optimal layer allocation so that redneck scripts are not required. Server logs (preferably with `OLLAMA_DEBUG=2` to see layer assignments, this would also include prompts that might need redacting) would help in debugging.
Author
Owner

@gordan-bobic commented on GitHub (Aug 21, 2025):

The preferred solution would be to have ollama find the optimal layer allocation so that redneck scripts are not required.

I have seen years of pain arise from such assumptions, only for the problem eventually be fixed by capitulating and providing and proper override method.

Server logs (preferably with OLLAMA_DEBUG=2 to see layer assignments, this would also include prompts that might need redacting) would help in debugging.

Here is what it looks like when left to it's own devices. No prompt neeed, just set num_ctx to more than about 14100.

Aug 21 15:25:54 ai ollama[46387]: time=2025-08-21T15:25:54.881+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=-1 layers.model=101 layers.offload=100 layers.split="[25 25 25 25]" memory.available="[20.0 GiB 21.3 GiB 21.3 GiB 21.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="82.7 GiB" memory.required.partial="77.4 GiB" memory.required.kv="10.0 GiB" memory.required.allocations="[19.3 GiB 19.3 GiB 19.4 GiB 19.4 GiB]" memory.weights.total="48.5 GiB" memory.weights.repeating="47.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="3.8 GiB" memory.graph.partial="3.8 GiB" projector.weights="1.7 GiB" projector.graph="2.8 GiB"

But force it with an override to offload all 101 layers and set the split to 24,27,27,23, it will consistently handle up to a little over num_ctx=29000 even if you actually fill it up with 29000 tokens of lorem ipsum (just ask the model to generate as much of it as possible repeatedly, if you don't want to paste in a few large chunks of it). Without any context payload, it will allocate with up to about 33000 num_ctx in VRAM without OOM, but as the context gets full it starts to need enough memory to make CUDA OOM. 29000 (actually up to about 29700) works even with the context full to the brim.

I'm probably going to look into getting a feature implemented at some point for config file with overrides. If/when that happens, happy to submit a PR if there is interest in incorporating it. Expecting the heuristic to always get it sufficiently perfectly right to never need an user override mechanism is, IMHO, a recipe for long term disappointment.

<!-- gh-comment-id:3211194177 --> @gordan-bobic commented on GitHub (Aug 21, 2025): > The preferred solution would be to have ollama find the optimal layer allocation so that redneck scripts are not required. I have seen years of pain arise from such assumptions, only for the problem eventually be fixed by capitulating and providing and proper override method. > Server logs (preferably with OLLAMA_DEBUG=2 to see layer assignments, this would also include prompts that might need redacting) would help in debugging. Here is what it looks like when left to it's own devices. No prompt neeed, just set num_ctx to more than about 14100. ``` Aug 21 15:25:54 ai ollama[46387]: time=2025-08-21T15:25:54.881+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=-1 layers.model=101 layers.offload=100 layers.split="[25 25 25 25]" memory.available="[20.0 GiB 21.3 GiB 21.3 GiB 21.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="82.7 GiB" memory.required.partial="77.4 GiB" memory.required.kv="10.0 GiB" memory.required.allocations="[19.3 GiB 19.3 GiB 19.4 GiB 19.4 GiB]" memory.weights.total="48.5 GiB" memory.weights.repeating="47.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="3.8 GiB" memory.graph.partial="3.8 GiB" projector.weights="1.7 GiB" projector.graph="2.8 GiB" ``` But force it with an override to offload all 101 layers and set the split to 24,27,27,23, it will consistently handle up to a little over num_ctx=29000 even if you actually fill it up with 29000 tokens of lorem ipsum (just ask the model to generate as much of it as possible repeatedly, if you don't want to paste in a few large chunks of it). Without any context payload, it will allocate with up to about 33000 num_ctx in VRAM without OOM, but as the context gets full it starts to need enough memory to make CUDA OOM. 29000 (actually up to about 29700) works even with the context full to the brim. I'm probably going to look into getting a feature implemented at some point for config file with overrides. If/when that happens, happy to submit a PR if there is interest in incorporating it. Expecting the heuristic to always get it sufficiently perfectly right to never need an user override mechanism is, IMHO, a recipe for long term disappointment.
Author
Owner

@jessegross commented on GitHub (Aug 21, 2025):

I would recommend setting OLLAMA_NEW_ESTIMATES=1. The changes that you seeing are the result of an overhaul of the memory allocation code, which significantly improves layouts, especially on multi-GPU systems. If it doesn't work well, please post the logs here. It will become the default behavior in the near future.

The communication between the Ollama server and runner is an internal API. It's not something that we can guarantee compatibility on between versions. And to be completely honest and to save you some disappointment, the PR that you are describing with a config file is not likely to be accepted.

<!-- gh-comment-id:3211382481 --> @jessegross commented on GitHub (Aug 21, 2025): I would recommend setting OLLAMA_NEW_ESTIMATES=1. The changes that you seeing are the result of an overhaul of the memory allocation code, which significantly improves layouts, especially on multi-GPU systems. If it doesn't work well, please post the logs here. It will become the default behavior in the near future. The communication between the Ollama server and runner is an internal API. It's not something that we can guarantee compatibility on between versions. And to be completely honest and to save you some disappointment, the PR that you are describing with a config file is not likely to be accepted.
Author
Owner

@gordan-bobic commented on GitHub (Aug 21, 2025):

With OLLAMA_NEW_ESTIMATES=1 it doesn't actually seem to report how it split the layers, but I can confirm that it offloaded all 101 layers to the GPU. Looking at the memory usage on the GPUs, it is not using the optimal split because what works with my explicit 24,27,27,23 split without OOM-ing, actually OOMs with OLLAMA_NEW_ESTIMATES=1.

That means that OLLAMA_NEW_ESTIMATES=1 introduces instability due to OOM in addition to choosing a sub-optimal layer split. If I had to guess by looking at the memory usage, it decided on 25,27,27,22 which resulted in OOM on device 0 because device 0 in this case also runs the Xorg console, which seems to eat an extra GB of VRAM on that GPU, which ollama probably doesn't account for.

<!-- gh-comment-id:3211578672 --> @gordan-bobic commented on GitHub (Aug 21, 2025): With `OLLAMA_NEW_ESTIMATES=1` it doesn't actually seem to report how it split the layers, but I can confirm that it offloaded all 101 layers to the GPU. Looking at the memory usage on the GPUs, it is not using the optimal split because what works with my explicit `24,27,27,23` split without OOM-ing, actually OOMs with `OLLAMA_NEW_ESTIMATES=1`. That means that `OLLAMA_NEW_ESTIMATES=1` introduces instability due to OOM in addition to choosing a sub-optimal layer split. If I had to guess by looking at the memory usage, it decided on 25,27,27,22 which resulted in OOM on device 0 because device 0 in this case also runs the Xorg console, which seems to eat an extra GB of VRAM on that GPU, which ollama probably doesn't account for.
Author
Owner

@jessegross commented on GitHub (Aug 21, 2025):

Please post the logs, ideally with OLLAMA_DEBUG=1

<!-- gh-comment-id:3211597451 --> @jessegross commented on GitHub (Aug 21, 2025): Please post the logs, ideally with OLLAMA_DEBUG=1
Author
Owner

@gordan-bobic commented on GitHub (Aug 21, 2025):

Here is the OLLAMA_DEBUG=2 log with OLLAMA_NEW_ESTIMATES=1 set and the resulting OOM. When it OOM-ed on GPU 0, there was memory to spare on the other GPUs.

Input is 166 paragraphs of lorem ipsum generated using this: https://www.lipsum.com/, which totals a little over 30,000 tokens with llama3.2-vision:90b

OLLAMA_NEW_ESTIMATES.log.gz

<!-- gh-comment-id:3211849633 --> @gordan-bobic commented on GitHub (Aug 21, 2025): Here is the OLLAMA_DEBUG=2 log with OLLAMA_NEW_ESTIMATES=1 set and the resulting OOM. When it OOM-ed on GPU 0, there was memory to spare on the other GPUs. Input is 166 paragraphs of lorem ipsum generated using this: https://www.lipsum.com/, which totals a little over 30,000 tokens with `llama3.2-vision:90b` [OLLAMA_NEW_ESTIMATES.log.gz](https://github.com/user-attachments/files/21924619/OLLAMA_NEW_ESTIMATES.log.gz)
Author
Owner

@gordan-bobic commented on GitHub (Aug 21, 2025):

For reference, here is the log from my custom optimised layer split override that works fine without OOM-ing on 0.11.4. I use a wrapper script to override the --n-gpu-layers and --tensor-split parameters when ollama runner is invoked (and that no longer works with 0.11.5.

ollama-custom.log.gz

<!-- gh-comment-id:3211985453 --> @gordan-bobic commented on GitHub (Aug 21, 2025): For reference, here is the log from my custom optimised layer split override that works fine without OOM-ing on 0.11.4. I use a wrapper script to override the `--n-gpu-layers` and `--tensor-split` parameters when `ollama runner` is invoked (and that no longer works with 0.11.5. [ollama-custom.log.gz](https://github.com/user-attachments/files/21925830/ollama-custom.log.gz)
Author
Owner

@jessegross commented on GitHub (Aug 21, 2025):

Thank you for the logs.

OLLAMA_NEW_ESTIMATES logs the layer allocations in a different format but if we map them back to the traditional way, I see that it is calculating 21,27,26,27. As you point out, X is using extra memory but that is being taken into account. In fact, it is actually more conservative in this regard compared to your manual settings while still offloading all of the layers.

Note that in your custom version, the context length is 4096 whereas on the new estimates version it is 29696. My guess is that if you used the same context length on your version and also filled it with text, you would see the same OOM.

This OOM has become a more prominent issue as we have fixed many of the existing issues using the new memory estimates (see #11753). As a workaround, you should be able to avoid it by setting OLLAMA_FLASH_ATTENTION=1 in addition to OLLAMA_NEW_ESTIMATES=1.

If you are curious, you can see the layouts used by the new estimates in the following log:
Aug 21 22:14:21 overmind ollama[203987]: time=2025-08-21T22:14:21.430+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:29696 KvCacheType: NumThreads:24 GPULayers:101[ID:GPU-ea5a8e95-7eb3-0b4b-e92c-688d62f8fe3f Layers:26(0..25) ID:GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Layers:27(26..52) ID:GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 Layers:27(53..79) ID:GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Layers:21(80..100)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"

You can see the layer counts in addition to the GPU IDs. They may not be in the same order as you previously saw but you can reorder by them by matching the IDs to these log lines:

Aug 21 22:14:19 overmind ollama[203987]: ggml_cuda_init: found 4 CUDA devices:
Aug 21 22:14:19 overmind ollama[203987]:  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a
Aug 21 22:14:19 overmind ollama[203987]:  Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174
Aug 21 22:14:19 overmind ollama[203987]:  Device 2: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-ea5a8e95-7eb3-0b4b-e92c-688d62f8fe3f
Aug 21 22:14:19 overmind ollama[203987]:  Device 3: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7
<!-- gh-comment-id:3212117754 --> @jessegross commented on GitHub (Aug 21, 2025): Thank you for the logs. OLLAMA_NEW_ESTIMATES logs the layer allocations in a different format but if we map them back to the traditional way, I see that it is calculating 21,27,26,27. As you point out, X is using extra memory but that is being taken into account. In fact, it is actually more conservative in this regard compared to your manual settings while still offloading all of the layers. Note that in your custom version, the context length is 4096 whereas on the new estimates version it is 29696. My guess is that if you used the same context length on your version and also filled it with text, you would see the same OOM. This OOM has become a more prominent issue as we have fixed many of the existing issues using the new memory estimates (see #11753). As a workaround, you should be able to avoid it by setting OLLAMA_FLASH_ATTENTION=1 in addition to OLLAMA_NEW_ESTIMATES=1. If you are curious, you can see the layouts used by the new estimates in the following log: `Aug 21 22:14:21 overmind ollama[203987]: time=2025-08-21T22:14:21.430+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:29696 KvCacheType: NumThreads:24 GPULayers:101[ID:GPU-ea5a8e95-7eb3-0b4b-e92c-688d62f8fe3f Layers:26(0..25) ID:GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Layers:27(26..52) ID:GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 Layers:27(53..79) ID:GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Layers:21(80..100)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"` You can see the layer counts in addition to the GPU IDs. They may not be in the same order as you previously saw but you can reorder by them by matching the IDs to these log lines: ``` Aug 21 22:14:19 overmind ollama[203987]: ggml_cuda_init: found 4 CUDA devices: Aug 21 22:14:19 overmind ollama[203987]: Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Aug 21 22:14:19 overmind ollama[203987]: Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Aug 21 22:14:19 overmind ollama[203987]: Device 2: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-ea5a8e95-7eb3-0b4b-e92c-688d62f8fe3f Aug 21 22:14:19 overmind ollama[203987]: Device 3: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 ```
Author
Owner

@gordan-bobic commented on GitHub (Aug 21, 2025):

Note that in your custom version, the context length is 4096 whereas on the new estimates version it is 29696.

That was on the 2nd iteration, if you look at the first response in the log, it actually used 29696 context length. The reason why one part has shorter, 4096 token context is to do with now Open-WebUI works, because it creates short labels for chats based on the first 4K tokens, but the real response uses the full context. I tested it multiple times, and with my custom split it definitely uses the full 29Ki context size and it definitely doesn't OOM on ollama 0.11.4 where I can apply my override.

Thanks for explaining how the layer split is described in the new log, that is very helpful. I don't know, then, why my split doesn't OOM in 0.11.4. It could be that there is another change in 0.11.5 that is causing the OOM. I would expect the OOM with the OLLAMA_NEW_ESTIMATES split to occur on GPU 3 instead.

I can confirm that OLLAMA_FLASH_ATTENTION does seem to reduce GPU memory usage slightly, but this seems to to be happening due to the context being reduced to a little over 27,000 tokens, no matter how long the chat history is and how high I set the --ctx-size.

<!-- gh-comment-id:3212233750 --> @gordan-bobic commented on GitHub (Aug 21, 2025): > Note that in your custom version, the context length is 4096 whereas on the new estimates version it is 29696. That was on the 2nd iteration, if you look at the first response in the log, it actually used 29696 context length. The reason why one part has shorter, 4096 token context is to do with now Open-WebUI works, because it creates short labels for chats based on the first 4K tokens, but the real response uses the full context. I tested it multiple times, and with my custom split it definitely uses the full 29Ki context size and it definitely doesn't OOM on ollama 0.11.4 where I can apply my override. Thanks for explaining how the layer split is described in the new log, that is very helpful. I don't know, then, why my split doesn't OOM in 0.11.4. It could be that there is another change in 0.11.5 that is causing the OOM. I would expect the OOM with the OLLAMA_NEW_ESTIMATES split to occur on GPU 3 instead. I can confirm that OLLAMA_FLASH_ATTENTION does seem to reduce GPU memory usage slightly, but this seems to to be happening due to the context being reduced to a little over 27,000 tokens, no matter how long the chat history is and how high I set the `--ctx-size`.
Author
Owner

@jessegross commented on GitHub (Aug 21, 2025):

I see the first iteration in your logs with the larger context now.

One more piece of the layer allocations that I didn't mention is that the new estimates also explicitly specify which layers to offload on which GPU so it is no longer a simple in-order mapping. As a result, even though there are fewer layers on GPU 0, some of the layers are larger. Adding up all of the allocations on your version, I see 18G, whereas with the new estimates it is 18.4G. Probably the old default ordering enables a slightly more even packing than the new one but that's mostly just luck as it is very close.

Flash attention should not reduce the maximum context length, where are you seeing that? Does it prevent the crashes when used with the new estimates? It should reduce memory usage by avoiding some of the allocations for intermediate states.

<!-- gh-comment-id:3212343323 --> @jessegross commented on GitHub (Aug 21, 2025): I see the first iteration in your logs with the larger context now. One more piece of the layer allocations that I didn't mention is that the new estimates also explicitly specify which layers to offload on which GPU so it is no longer a simple in-order mapping. As a result, even though there are fewer layers on GPU 0, some of the layers are larger. Adding up all of the allocations on your version, I see 18G, whereas with the new estimates it is 18.4G. Probably the old default ordering enables a slightly more even packing than the new one but that's mostly just luck as it is very close. Flash attention should not reduce the maximum context length, where are you seeing that? Does it prevent the crashes when used with the new estimates? It should reduce memory usage by avoiding some of the allocations for intermediate states.
Author
Owner

@gordan-bobic commented on GitHub (Aug 21, 2025):

the new estimates also explicitly specify which layers to offload on which GPU so it is no longer a simple in-order mapping.

The logs don't seem to show any out-of-order mapping, though:

Layers:26(0..25) ID:GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Layers:27(26..52) ID:GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 Layers:27(53..79) ID:GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Layers:21(80..100)]

Unless I am misunderstanding what you meant by that. I would also expect multiple passes back and forth between GPUs in a single cycle would be less efficient, than sequential layer splitting.

Flash attention should not reduce the maximum context length, where are you seeing that?

That's what I thought, but regardless of the amount of data in the context and the set ctx-size, the prompt_tokens/total_tokens as reported by Open-WebUI never exceeded about 27K, and I can see from the ollama runner parrameters (in 0.11.4) that ctx-size was set to 37K at the time. If I disable flash attention, it seems to go all the way up to the calibrated maximum I can achieve without OOM on 0.11.4 of about 29696.

But debugging OLLAMA_FLASH_ATTENTION behaviour is probably worthy of a separate ticket.

<!-- gh-comment-id:3212479377 --> @gordan-bobic commented on GitHub (Aug 21, 2025): > the new estimates also explicitly specify which layers to offload on which GPU so it is no longer a simple in-order mapping. The logs don't seem to show any out-of-order mapping, though: ``` Layers:26(0..25) ID:GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Layers:27(26..52) ID:GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 Layers:27(53..79) ID:GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Layers:21(80..100)] ``` Unless I am misunderstanding what you meant by that. I would also expect multiple passes back and forth between GPUs in a single cycle would be less efficient, than sequential layer splitting. > Flash attention should not reduce the maximum context length, where are you seeing that? That's what I thought, but regardless of the amount of data in the context and the set ctx-size, the prompt_tokens/total_tokens as reported by Open-WebUI never exceeded about 27K, and I can see from the ollama runner parrameters (in 0.11.4) that ctx-size was set to 37K at the time. If I disable flash attention, it seems to go all the way up to the calibrated maximum I can achieve without OOM on 0.11.4 of about 29696. But debugging OLLAMA_FLASH_ATTENTION behaviour is probably worthy of a separate ticket.
Author
Owner

@jessegross commented on GitHub (Aug 22, 2025):

the new estimates also explicitly specify which layers to offload on which GPU so it is no longer a simple in-order mapping.

The logs don't seem to show any out-of-order mapping, though:

Layers:26(0..25) ID:GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Layers:27(26..52) ID:GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 Layers:27(53..79) ID:GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Layers:21(80..100)]

Mapping the IDs to names you will see this is CUDA2, CUDA1, CUDA3, CUDA0. Previously it was always CUDA0, CUDA1, CUDA2, CUDA3.

Unless I am misunderstanding what you meant by that. I would also expect multiple passes back and forth between GPUs in a single cycle would be less efficient, than sequential layer splitting.

It's still sequential, just a different ordering. We don't consider PCIe topology and maybe the original enumeration order will group things better, though it probably doesn't make a difference for most consumer PCs. The new ordering is mostly an artifact of the allocation system. Using the original order would help your case but that's likely just luck.

<!-- gh-comment-id:3212531643 --> @jessegross commented on GitHub (Aug 22, 2025): > > the new estimates also explicitly specify which layers to offload on which GPU so it is no longer a simple in-order mapping. > > The logs don't seem to show any out-of-order mapping, though: > > ``` > Layers:26(0..25) ID:GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Layers:27(26..52) ID:GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 Layers:27(53..79) ID:GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Layers:21(80..100)] > ``` Mapping the IDs to names you will see this is CUDA2, CUDA1, CUDA3, CUDA0. Previously it was always CUDA0, CUDA1, CUDA2, CUDA3. > Unless I am misunderstanding what you meant by that. I would also expect multiple passes back and forth between GPUs in a single cycle would be less efficient, than sequential layer splitting. It's still sequential, just a different ordering. We don't consider PCIe topology and maybe the original enumeration order will group things better, though it probably doesn't make a difference for most consumer PCs. The new ordering is mostly an artifact of the allocation system. Using the original order would help your case but that's likely just luck.
Author
Owner

@gordan-bobic commented on GitHub (Aug 22, 2025):

Ah, I understand what you were referring to now.

The original point remains, though - there is definitely value in being able to override the --n-gpu-layers and --tensor-split parameters, because heuristics are never infallible.

<!-- gh-comment-id:3213435420 --> @gordan-bobic commented on GitHub (Aug 22, 2025): Ah, I understand what you were referring to now. The original point remains, though - there is definitely value in being able to override the `--n-gpu-layers` and `--tensor-split` parameters, because heuristics are never infallible.
Author
Owner

@morgwai commented on GitHub (Sep 1, 2025):

I have an asymmetric GPU setup: RTX-3090 24GB and GTX-1080ti 11GB, so in case of models sized 24-30GB I want to put as many layers as possible on 3090 and only the remaining ones on 1080ti for obvious performance reasons. By default ollama splits them to obtain roughly the same percentage of VRAM utilization on both cards (for example 18GB,8GB), which is waaay slower than if I specify --n-gpu-layers and --tensor-split manually as described by @gordan-bobic (thanks for your Altechnative article, man!).
Sometimes I need an even more elaborate tuning, for example I need a specific amount of VRAM to be left on a specific card to run some other stuff there.

Several ppl have asked for similar configuration options (for example 10172) or described how tuning these pre 0.11.5 params can improve performance (and I could keep providing more and more links of course...), but ollama team keeps insisting that they know better how ppl should run their workloads... Not everyone is Steve Jobs to get away with such an attitude: ppl will just migrate to llama.cpp, vLLM or ExLlamaV2 or whichever engine gives most flexibility (I'm just investigating it myself ATM).

<!-- gh-comment-id:3242727973 --> @morgwai commented on GitHub (Sep 1, 2025): I have an asymmetric GPU setup: RTX-3090 24GB and GTX-1080ti 11GB, so in case of models sized 24-30GB I want to put as many layers as possible on 3090 and only the remaining ones on 1080ti for obvious performance reasons. By default `ollama` splits them to obtain roughly the same percentage of VRAM utilization on both cards (for example 18GB,8GB), which is waaay slower than if I specify `--n-gpu-layers` and `--tensor-split` manually as described by @gordan-bobic (thanks for your Altechnative article, man!). Sometimes I need an even more elaborate tuning, for example I need a specific amount of VRAM to be left on a specific card to run some other stuff there. Several ppl have asked for similar configuration options (for example [10172](https://github.com/ollama/ollama/issues/10172)) or described [how tuning these pre 0.11.5 params can improve performance](https://geekbacon.com/2025/05/03/understanding-vram-usage-in-ollama-with-large-models/) (and I could keep providing more and more links of course...), but `ollama` team keeps insisting that they know better how ppl should run their workloads... Not everyone is Steve Jobs to get away with such an attitude: ppl will just migrate to llama.cpp, vLLM or ExLlamaV2 or whichever engine gives most flexibility (I'm just investigating it myself ATM).
Author
Owner

@gordan-bobic commented on GitHub (Sep 26, 2025):

The latest version (0.12.2) is spectacularly, hilariously bad at figuring out the split.

With a manual split on 0.11.4 I run hermes4:70b with full 128K context on 4x22GB GPUs with full 80/80 layer GPU offload.

With 0.12.2 it offloads only 18/80 layers to the GPU which makes the whole system completely unusable.

Removing options for manually overriding auto-detected settings is NEVER a good idea.

<!-- gh-comment-id:3338035092 --> @gordan-bobic commented on GitHub (Sep 26, 2025): The latest version (0.12.2) is spectacularly, hilariously bad at figuring out the split. With a manual split on 0.11.4 I run hermes4:70b with full 128K context on 4x22GB GPUs with full 80/80 layer GPU offload. With 0.12.2 it offloads only 18/80 layers to the GPU which makes the whole system completely unusable. Removing options for manually overriding auto-detected settings is NEVER a good idea.
Author
Owner

@rick-github commented on GitHub (Sep 26, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:3338060089 --> @rick-github commented on GitHub (Sep 26, 2025): Server logs may aid in debugging.
Author
Owner

@gordan-bobic commented on GitHub (Sep 26, 2025):

This ticket isn't about debugging layer splitting heuristics, it is about (re)adding a feature to facilitate manual override.

<!-- gh-comment-id:3338149039 --> @gordan-bobic commented on GitHub (Sep 26, 2025): This ticket isn't about debugging layer splitting heuristics, it is about (re)adding a feature to facilitate manual override.
Author
Owner

@jessegross commented on GitHub (Sep 26, 2025):

There was never functionality to allow manual control of the layer assignments, as the interface being manipulated by the scripts described here is internal and not publicly exposed.

Offloading will only get more complicated over time as we optimize memory usage and we don't want an ever expanding API, so we don't plan to more controls than we currently have.

<!-- gh-comment-id:3339655621 --> @jessegross commented on GitHub (Sep 26, 2025): There was never functionality to allow manual control of the layer assignments, as the interface being manipulated by the scripts described here is internal and not publicly exposed. Offloading will only get more complicated over time as we optimize memory usage and we don't want an ever expanding API, so we don't plan to more controls than we currently have.
Author
Owner

@gordan-bobic commented on GitHub (Sep 26, 2025):

As you have been optimizing memory usage, things seem to have been getting worse rather than better. I'll get a feature for this implemented and pull requested.

<!-- gh-comment-id:3339798086 --> @gordan-bobic commented on GitHub (Sep 26, 2025): As you have been optimizing memory usage, things seem to have been getting worse rather than better. I'll get a feature for this implemented and pull requested.
Author
Owner

@chrisoutwright commented on GitHub (Sep 28, 2025):

Also have completely uneven splits with same vram gpu: llama-3.3-nemotron-super-v1.5-q4km:49b
https://github.com/ollama/ollama/issues/7047#issuecomment-3342197917

llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free
...
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors:        CUDA0 model buffer size = 18993.09 MiB
load_tensors:        CUDA1 model buffer size =  9251.61 MiB
load_tensors:          CPU model buffer size =   563.62 MiB

using: (new Estimate or SchedSpred changes do not help)

$envNode2 = @{
    HostAddress = "0.0.0.0:11434"
    CUDA = "0,1"
    OllamaPath = "D:\Ollama\models"
    MaxLoadedModels = "0"
    NumParallel = "1"
    SchedSpred = "1"
    FlashAttention = "1"
    KeepAlive = "20m"
	NewEstimate = "0"
	KVCacheType = "q8_0"
    
}
#$envNode6,$envNode5,
# Set environment variables for each node and start the ollama serve command
foreach ($env in @($envNode2 )) {
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_HOST' -Value $env.HostAddress
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'CUDA_VISIBLE_DEVICES' -Value $env.CUDA
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_MODELS' -Value $env.OllamaPath
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_MAX_LOADED_MODELS' -Value $env.MaxLoadedModels
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_NUM_PARALLEL' -Value $env.NumParallel
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_SCHED_SPREAD' -Value $env.SchedSpred
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_FLASH_ATTENTION' -Value $env.FlashAttention
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_KEEP_ALIVE' -Value $env.KeepAlive
    
    # Add KV cache quantization setting (8-bit q8_0)
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_KV_CACHE_TYPE' -Value $env.KVCacheType
	Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_NEW_ESTIMATES' -Value $env.NewEstimate
	

    
    Start-Process powershell -ArgumentList "-Command `"`$env:OLLAMA_HOST='$($env.HostAddress)'; echo `$env:OLLAMA_HOST; `$env:CUDA_VISIBLE_DEVICES='$($env.CUDA)'; echo `$env:CUDA_VISIBLE_DEVICES; `$env:OLLAMA_MODELS='$($env.OllamaPath)'; echo `$env:OLLAMA_MODELS; `$env:OLLAMA_SCHED_SPREAD='$($env.SchedSpred)'; echo `$env:OLLAMA_SCHED_SPREAD; `$env:OLLAMA_FLASH_ATTENTION='$($env.FlashAttention)';  echo `$env:OLLAMA_FLASH_ATTENTION; `$env:OLLAMA_KEEP_ALIVE='$($env.KeepAlive)'; echo `$env:OLLAMA_KEEP_ALIVE; `$env:OLLAMA_KV_CACHE_TYPE='$($env.KVCacheType)'; echo `$env:OLLAMA_KV_CACHE_TYPE;`$env:OLLAMA_NEW_ESTIMATES='$($env.NewEstimate)'; echo `$env:OLLAMA_NEW_ESTIMATES; ollama serve; Read-Host 'Press any key to close the instance.'`"" -WindowStyle Normal -Verb RunAs
}
<!-- gh-comment-id:3342201768 --> @chrisoutwright commented on GitHub (Sep 28, 2025): Also have completely uneven splits with same vram gpu: llama-3.3-nemotron-super-v1.5-q4km:49b https://github.com/ollama/ollama/issues/7047#issuecomment-3342197917 ``` llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free ... load_tensors: offloading 80 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 81/81 layers to GPU load_tensors: CUDA0 model buffer size = 18993.09 MiB load_tensors: CUDA1 model buffer size = 9251.61 MiB load_tensors: CPU model buffer size = 563.62 MiB ``` using: (new Estimate or SchedSpred changes do not help) ``` $envNode2 = @{ HostAddress = "0.0.0.0:11434" CUDA = "0,1" OllamaPath = "D:\Ollama\models" MaxLoadedModels = "0" NumParallel = "1" SchedSpred = "1" FlashAttention = "1" KeepAlive = "20m" NewEstimate = "0" KVCacheType = "q8_0" } #$envNode6,$envNode5, # Set environment variables for each node and start the ollama serve command foreach ($env in @($envNode2 )) { Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_HOST' -Value $env.HostAddress Set-ItemProperty -Path 'HKCU:\Environment' -Name 'CUDA_VISIBLE_DEVICES' -Value $env.CUDA Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_MODELS' -Value $env.OllamaPath Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_MAX_LOADED_MODELS' -Value $env.MaxLoadedModels Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_NUM_PARALLEL' -Value $env.NumParallel Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_SCHED_SPREAD' -Value $env.SchedSpred Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_FLASH_ATTENTION' -Value $env.FlashAttention Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_KEEP_ALIVE' -Value $env.KeepAlive # Add KV cache quantization setting (8-bit q8_0) Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_KV_CACHE_TYPE' -Value $env.KVCacheType Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_NEW_ESTIMATES' -Value $env.NewEstimate Start-Process powershell -ArgumentList "-Command `"`$env:OLLAMA_HOST='$($env.HostAddress)'; echo `$env:OLLAMA_HOST; `$env:CUDA_VISIBLE_DEVICES='$($env.CUDA)'; echo `$env:CUDA_VISIBLE_DEVICES; `$env:OLLAMA_MODELS='$($env.OllamaPath)'; echo `$env:OLLAMA_MODELS; `$env:OLLAMA_SCHED_SPREAD='$($env.SchedSpred)'; echo `$env:OLLAMA_SCHED_SPREAD; `$env:OLLAMA_FLASH_ATTENTION='$($env.FlashAttention)'; echo `$env:OLLAMA_FLASH_ATTENTION; `$env:OLLAMA_KEEP_ALIVE='$($env.KeepAlive)'; echo `$env:OLLAMA_KEEP_ALIVE; `$env:OLLAMA_KV_CACHE_TYPE='$($env.KVCacheType)'; echo `$env:OLLAMA_KV_CACHE_TYPE;`$env:OLLAMA_NEW_ESTIMATES='$($env.NewEstimate)'; echo `$env:OLLAMA_NEW_ESTIMATES; ollama serve; Read-Host 'Press any key to close the instance.'`"" -WindowStyle Normal -Verb RunAs } ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33735