[GH-ISSUE #12010] Feature: (Re)introduce functionality for manually overriding layer splitting and GPU offload decisions #70034

New Issue

GiteaMirror · 2026-05-04T20:06:34-05:00

GiteaMirror commented

2026-05-04 20:06:34 -05:00

Originally created by @gordan-bobic on GitHub (Aug 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12010

Removal of --tensor-split on 0.11.5 is a MASSIVE leap backward. The allocation split calibration heuristic is beyond terrible. With multiple GPUs and a large model, the heuristic split results in context size with full GPU offload being as low as 50% of what can otherwise be achieved.

Example:
Model: llama3.2-vision:90b, 101 layers
VRAM: 4x 22GB
layer-split (with 0.11.4): 26,25,25,25
Maximum num_ctx length before 1 layer is moved to CPU: 14094

This is vastly sub-optimal. Overriding the layer split to 24,27,27,23 we can achieve a fully populated context size with num_ctx ~29700 without CPU offload and without OOM.
This is not a small difference, this is more than 2x, and with default split 15-16GB of VRAM remains unused.

Without the layer-split being passed, this is difficult to override.

If you are going to insist on removing the --layer-split paremeter, then at the very least make it overridable configurable in some way.

Difference bewen 14,000 and 29,000 of usable context on same hardware is not a small difference, it is a difference between unusuable for most tasks and comfortably usable for most tasks.

Originally created by @gordan-bobic on GitHub (Aug 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12010 Removal of --tensor-split on 0.11.5 is a MASSIVE leap backward. The allocation split calibration heuristic is beyond terrible. With multiple GPUs and a large model, the heuristic split results in context size with full GPU offload being as low as 50% of what can otherwise be achieved. Example: Model: llama3.2-vision:90b, 101 layers VRAM: 4x 22GB layer-split (with 0.11.4): 26,25,25,25 Maximum num_ctx length before 1 layer is moved to CPU: 14094 This is vastly sub-optimal. Overriding the layer split to `24,27,27,23` we can achieve a fully populated context size with num_ctx ~29700 without CPU offload and without OOM. This is not a small difference, this is more than 2x, and with default split 15-16GB of VRAM remains unused. Without the layer-split being passed, this is difficult to override. If you are going to insist on removing the --layer-split paremeter, then at the very least make it overridable configurable in some way. Difference bewen 14,000 and 29,000 of usable context on same hardware is not a small difference, it is a difference between unusuable for most tasks and comfortably usable for most tasks.

GiteaMirror added the feature request memory labels 2026-05-04 20:06:34 -05:00

GiteaMirror closed this issue

2026-05-04 20:06:36 -05:00

GiteaMirror commented

2026-05-04 20:06:37 -05:00

@rick-github commented on GitHub (Aug 21, 2025):

0.11.5 changes how a model is scheduled across multiple GPUs with a view to decreasing memory overhead and power consumption. Server logs may aid in debugging.

@rick-github commented on GitHub (Aug 21, 2025): 0.11.5 [changes](https://github.com/ollama/ollama/pull/10678) how a model is scheduled across multiple GPUs with a view to decreasing memory overhead and power consumption. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.

GiteaMirror commented

2026-05-04 20:06:38 -05:00

@gordan-bobic commented on GitHub (Aug 21, 2025):

Log shows it splitting the exact same way as 0.11.4 for this model, 26,25,25,25. 0.11.5 behaves no differently in terms of split in this case, except it takes away my ability to override it using an injected redneck script that gets it to optimal settings.

Is there some other way/place to override it?

@gordan-bobic commented on GitHub (Aug 21, 2025): Log shows it splitting the exact same way as 0.11.4 for this model, 26,25,25,25. 0.11.5 behaves no differently in terms of split in this case, except it takes away my ability to override it using an injected redneck script that gets it to optimal settings. Is there some other way/place to override it?

GiteaMirror commented

2026-05-04 20:06:40 -05:00

@rick-github commented on GitHub (Aug 21, 2025):

Is there some other way/place to override it?

Unfortunately not currently.

The preferred solution would be to have ollama find the optimal layer allocation so that redneck scripts are not required. Server logs (preferably with OLLAMA_DEBUG=2 to see layer assignments, this would also include prompts that might need redacting) would help in debugging.

@rick-github commented on GitHub (Aug 21, 2025): > Is there some other way/place to override it? Unfortunately not currently. The preferred solution would be to have ollama find the optimal layer allocation so that redneck scripts are not required. Server logs (preferably with `OLLAMA_DEBUG=2` to see layer assignments, this would also include prompts that might need redacting) would help in debugging.

GiteaMirror commented

2026-05-04 20:06:41 -05:00

@gordan-bobic commented on GitHub (Aug 21, 2025):

The preferred solution would be to have ollama find the optimal layer allocation so that redneck scripts are not required.

I have seen years of pain arise from such assumptions, only for the problem eventually be fixed by capitulating and providing and proper override method.

Server logs (preferably with OLLAMA_DEBUG=2 to see layer assignments, this would also include prompts that might need redacting) would help in debugging.

Here is what it looks like when left to it's own devices. No prompt neeed, just set num_ctx to more than about 14100.

Aug 21 15:25:54 ai ollama[46387]: time=2025-08-21T15:25:54.881+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=-1 layers.model=101 layers.offload=100 layers.split="[25 25 25 25]" memory.available="[20.0 GiB 21.3 GiB 21.3 GiB 21.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="82.7 GiB" memory.required.partial="77.4 GiB" memory.required.kv="10.0 GiB" memory.required.allocations="[19.3 GiB 19.3 GiB 19.4 GiB 19.4 GiB]" memory.weights.total="48.5 GiB" memory.weights.repeating="47.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="3.8 GiB" memory.graph.partial="3.8 GiB" projector.weights="1.7 GiB" projector.graph="2.8 GiB"

But force it with an override to offload all 101 layers and set the split to 24,27,27,23, it will consistently handle up to a little over num_ctx=29000 even if you actually fill it up with 29000 tokens of lorem ipsum (just ask the model to generate as much of it as possible repeatedly, if you don't want to paste in a few large chunks of it). Without any context payload, it will allocate with up to about 33000 num_ctx in VRAM without OOM, but as the context gets full it starts to need enough memory to make CUDA OOM. 29000 (actually up to about 29700) works even with the context full to the brim.

I'm probably going to look into getting a feature implemented at some point for config file with overrides. If/when that happens, happy to submit a PR if there is interest in incorporating it. Expecting the heuristic to always get it sufficiently perfectly right to never need an user override mechanism is, IMHO, a recipe for long term disappointment.

@gordan-bobic commented on GitHub (Aug 21, 2025): > The preferred solution would be to have ollama find the optimal layer allocation so that redneck scripts are not required. I have seen years of pain arise from such assumptions, only for the problem eventually be fixed by capitulating and providing and proper override method. > Server logs (preferably with OLLAMA_DEBUG=2 to see layer assignments, this would also include prompts that might need redacting) would help in debugging. Here is what it looks like when left to it's own devices. No prompt neeed, just set num_ctx to more than about 14100. ``` Aug 21 15:25:54 ai ollama[46387]: time=2025-08-21T15:25:54.881+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=-1 layers.model=101 layers.offload=100 layers.split="[25 25 25 25]" memory.available="[20.0 GiB 21.3 GiB 21.3 GiB 21.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="82.7 GiB" memory.required.partial="77.4 GiB" memory.required.kv="10.0 GiB" memory.required.allocations="[19.3 GiB 19.3 GiB 19.4 GiB 19.4 GiB]" memory.weights.total="48.5 GiB" memory.weights.repeating="47.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="3.8 GiB" memory.graph.partial="3.8 GiB" projector.weights="1.7 GiB" projector.graph="2.8 GiB" ``` But force it with an override to offload all 101 layers and set the split to 24,27,27,23, it will consistently handle up to a little over num_ctx=29000 even if you actually fill it up with 29000 tokens of lorem ipsum (just ask the model to generate as much of it as possible repeatedly, if you don't want to paste in a few large chunks of it). Without any context payload, it will allocate with up to about 33000 num_ctx in VRAM without OOM, but as the context gets full it starts to need enough memory to make CUDA OOM. 29000 (actually up to about 29700) works even with the context full to the brim. I'm probably going to look into getting a feature implemented at some point for config file with overrides. If/when that happens, happy to submit a PR if there is interest in incorporating it. Expecting the heuristic to always get it sufficiently perfectly right to never need an user override mechanism is, IMHO, a recipe for long term disappointment.

GiteaMirror commented

2026-05-04 20:06:42 -05:00

@jessegross commented on GitHub (Aug 21, 2025):

I would recommend setting OLLAMA_NEW_ESTIMATES=1. The changes that you seeing are the result of an overhaul of the memory allocation code, which significantly improves layouts, especially on multi-GPU systems. If it doesn't work well, please post the logs here. It will become the default behavior in the near future.

The communication between the Ollama server and runner is an internal API. It's not something that we can guarantee compatibility on between versions. And to be completely honest and to save you some disappointment, the PR that you are describing with a config file is not likely to be accepted.

@jessegross commented on GitHub (Aug 21, 2025): I would recommend setting OLLAMA_NEW_ESTIMATES=1. The changes that you seeing are the result of an overhaul of the memory allocation code, which significantly improves layouts, especially on multi-GPU systems. If it doesn't work well, please post the logs here. It will become the default behavior in the near future. The communication between the Ollama server and runner is an internal API. It's not something that we can guarantee compatibility on between versions. And to be completely honest and to save you some disappointment, the PR that you are describing with a config file is not likely to be accepted.

GiteaMirror commented

2026-05-04 20:06:43 -05:00

@gordan-bobic commented on GitHub (Aug 21, 2025):

With OLLAMA_NEW_ESTIMATES=1 it doesn't actually seem to report how it split the layers, but I can confirm that it offloaded all 101 layers to the GPU. Looking at the memory usage on the GPUs, it is not using the optimal split because what works with my explicit 24,27,27,23 split without OOM-ing, actually OOMs with OLLAMA_NEW_ESTIMATES=1.

That means that OLLAMA_NEW_ESTIMATES=1 introduces instability due to OOM in addition to choosing a sub-optimal layer split. If I had to guess by looking at the memory usage, it decided on 25,27,27,22 which resulted in OOM on device 0 because device 0 in this case also runs the Xorg console, which seems to eat an extra GB of VRAM on that GPU, which ollama probably doesn't account for.

@gordan-bobic commented on GitHub (Aug 21, 2025): With `OLLAMA_NEW_ESTIMATES=1` it doesn't actually seem to report how it split the layers, but I can confirm that it offloaded all 101 layers to the GPU. Looking at the memory usage on the GPUs, it is not using the optimal split because what works with my explicit `24,27,27,23` split without OOM-ing, actually OOMs with `OLLAMA_NEW_ESTIMATES=1`. That means that `OLLAMA_NEW_ESTIMATES=1` introduces instability due to OOM in addition to choosing a sub-optimal layer split. If I had to guess by looking at the memory usage, it decided on 25,27,27,22 which resulted in OOM on device 0 because device 0 in this case also runs the Xorg console, which seems to eat an extra GB of VRAM on that GPU, which ollama probably doesn't account for.

GiteaMirror commented

2026-05-04 20:06:44 -05:00

@jessegross commented on GitHub (Aug 21, 2025):

Please post the logs, ideally with OLLAMA_DEBUG=1

@jessegross commented on GitHub (Aug 21, 2025): Please post the logs, ideally with OLLAMA_DEBUG=1

GiteaMirror commented

2026-05-04 20:06:44 -05:00

@gordan-bobic commented on GitHub (Aug 21, 2025):

Here is the OLLAMA_DEBUG=2 log with OLLAMA_NEW_ESTIMATES=1 set and the resulting OOM. When it OOM-ed on GPU 0, there was memory to spare on the other GPUs.

Input is 166 paragraphs of lorem ipsum generated using this: https://www.lipsum.com/, which totals a little over 30,000 tokens with llama3.2-vision:90b

OLLAMA_NEW_ESTIMATES.log.gz

@gordan-bobic commented on GitHub (Aug 21, 2025): Here is the OLLAMA_DEBUG=2 log with OLLAMA_NEW_ESTIMATES=1 set and the resulting OOM. When it OOM-ed on GPU 0, there was memory to spare on the other GPUs. Input is 166 paragraphs of lorem ipsum generated using this: https://www.lipsum.com/, which totals a little over 30,000 tokens with `llama3.2-vision:90b` [OLLAMA_NEW_ESTIMATES.log.gz](https://github.com/user-attachments/files/21924619/OLLAMA_NEW_ESTIMATES.log.gz)

GiteaMirror commented

2026-05-04 20:06:47 -05:00

@gordan-bobic commented on GitHub (Aug 21, 2025):

For reference, here is the log from my custom optimised layer split override that works fine without OOM-ing on 0.11.4. I use a wrapper script to override the --n-gpu-layers and --tensor-split parameters when ollama runner is invoked (and that no longer works with 0.11.5.

ollama-custom.log.gz

@gordan-bobic commented on GitHub (Aug 21, 2025): For reference, here is the log from my custom optimised layer split override that works fine without OOM-ing on 0.11.4. I use a wrapper script to override the `--n-gpu-layers` and `--tensor-split` parameters when `ollama runner` is invoked (and that no longer works with 0.11.5. [ollama-custom.log.gz](https://github.com/user-attachments/files/21925830/ollama-custom.log.gz)

GiteaMirror commented

2026-05-04 20:06:50 -05:00

@jessegross commented on GitHub (Aug 21, 2025):

Thank you for the logs.

OLLAMA_NEW_ESTIMATES logs the layer allocations in a different format but if we map them back to the traditional way, I see that it is calculating 21,27,26,27. As you point out, X is using extra memory but that is being taken into account. In fact, it is actually more conservative in this regard compared to your manual settings while still offloading all of the layers.

Note that in your custom version, the context length is 4096 whereas on the new estimates version it is 29696. My guess is that if you used the same context length on your version and also filled it with text, you would see the same OOM.

This OOM has become a more prominent issue as we have fixed many of the existing issues using the new memory estimates (see #11753). As a workaround, you should be able to avoid it by setting OLLAMA_FLASH_ATTENTION=1 in addition to OLLAMA_NEW_ESTIMATES=1.

If you are curious, you can see the layouts used by the new estimates in the following log:
Aug 21 22:14:21 overmind ollama[203987]: time=2025-08-21T22:14:21.430+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:29696 KvCacheType: NumThreads:24 GPULayers:101[ID:GPU-ea5a8e95-7eb3-0b4b-e92c-688d62f8fe3f Layers:26(0..25) ID:GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Layers:27(26..52) ID:GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 Layers:27(53..79) ID:GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Layers:21(80..100)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"

You can see the layer counts in addition to the GPU IDs. They may not be in the same order as you previously saw but you can reorder by them by matching the IDs to these log lines:

Aug 21 22:14:19 overmind ollama[203987]: ggml_cuda_init: found 4 CUDA devices:
Aug 21 22:14:19 overmind ollama[203987]:  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a
Aug 21 22:14:19 overmind ollama[203987]:  Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174
Aug 21 22:14:19 overmind ollama[203987]:  Device 2: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-ea5a8e95-7eb3-0b4b-e92c-688d62f8fe3f
Aug 21 22:14:19 overmind ollama[203987]:  Device 3: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7

@jessegross commented on GitHub (Aug 21, 2025): Thank you for the logs. OLLAMA_NEW_ESTIMATES logs the layer allocations in a different format but if we map them back to the traditional way, I see that it is calculating 21,27,26,27. As you point out, X is using extra memory but that is being taken into account. In fact, it is actually more conservative in this regard compared to your manual settings while still offloading all of the layers. Note that in your custom version, the context length is 4096 whereas on the new estimates version it is 29696. My guess is that if you used the same context length on your version and also filled it with text, you would see the same OOM. This OOM has become a more prominent issue as we have fixed many of the existing issues using the new memory estimates (see #11753). As a workaround, you should be able to avoid it by setting OLLAMA_FLASH_ATTENTION=1 in addition to OLLAMA_NEW_ESTIMATES=1. If you are curious, you can see the layouts used by the new estimates in the following log: `Aug 21 22:14:21 overmind ollama[203987]: time=2025-08-21T22:14:21.430+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:29696 KvCacheType: NumThreads:24 GPULayers:101[ID:GPU-ea5a8e95-7eb3-0b4b-e92c-688d62f8fe3f Layers:26(0..25) ID:GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Layers:27(26..52) ID:GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 Layers:27(53..79) ID:GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Layers:21(80..100)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"` You can see the layer counts in addition to the GPU IDs. They may not be in the same order as you previously saw but you can reorder by them by matching the IDs to these log lines: ``` Aug 21 22:14:19 overmind ollama[203987]: ggml_cuda_init: found 4 CUDA devices: Aug 21 22:14:19 overmind ollama[203987]: Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Aug 21 22:14:19 overmind ollama[203987]: Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Aug 21 22:14:19 overmind ollama[203987]: Device 2: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-ea5a8e95-7eb3-0b4b-e92c-688d62f8fe3f Aug 21 22:14:19 overmind ollama[203987]: Device 3: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, ID: GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 ```

GiteaMirror commented

2026-05-04 20:06:52 -05:00

@gordan-bobic commented on GitHub (Aug 21, 2025):

Note that in your custom version, the context length is 4096 whereas on the new estimates version it is 29696.

That was on the 2nd iteration, if you look at the first response in the log, it actually used 29696 context length. The reason why one part has shorter, 4096 token context is to do with now Open-WebUI works, because it creates short labels for chats based on the first 4K tokens, but the real response uses the full context. I tested it multiple times, and with my custom split it definitely uses the full 29Ki context size and it definitely doesn't OOM on ollama 0.11.4 where I can apply my override.

Thanks for explaining how the layer split is described in the new log, that is very helpful. I don't know, then, why my split doesn't OOM in 0.11.4. It could be that there is another change in 0.11.5 that is causing the OOM. I would expect the OOM with the OLLAMA_NEW_ESTIMATES split to occur on GPU 3 instead.

I can confirm that OLLAMA_FLASH_ATTENTION does seem to reduce GPU memory usage slightly, but this seems to to be happening due to the context being reduced to a little over 27,000 tokens, no matter how long the chat history is and how high I set the --ctx-size.

@gordan-bobic commented on GitHub (Aug 21, 2025): > Note that in your custom version, the context length is 4096 whereas on the new estimates version it is 29696. That was on the 2nd iteration, if you look at the first response in the log, it actually used 29696 context length. The reason why one part has shorter, 4096 token context is to do with now Open-WebUI works, because it creates short labels for chats based on the first 4K tokens, but the real response uses the full context. I tested it multiple times, and with my custom split it definitely uses the full 29Ki context size and it definitely doesn't OOM on ollama 0.11.4 where I can apply my override. Thanks for explaining how the layer split is described in the new log, that is very helpful. I don't know, then, why my split doesn't OOM in 0.11.4. It could be that there is another change in 0.11.5 that is causing the OOM. I would expect the OOM with the OLLAMA_NEW_ESTIMATES split to occur on GPU 3 instead. I can confirm that OLLAMA_FLASH_ATTENTION does seem to reduce GPU memory usage slightly, but this seems to to be happening due to the context being reduced to a little over 27,000 tokens, no matter how long the chat history is and how high I set the `--ctx-size`.

GiteaMirror commented

2026-05-04 20:06:55 -05:00

@jessegross commented on GitHub (Aug 21, 2025):

I see the first iteration in your logs with the larger context now.

One more piece of the layer allocations that I didn't mention is that the new estimates also explicitly specify which layers to offload on which GPU so it is no longer a simple in-order mapping. As a result, even though there are fewer layers on GPU 0, some of the layers are larger. Adding up all of the allocations on your version, I see 18G, whereas with the new estimates it is 18.4G. Probably the old default ordering enables a slightly more even packing than the new one but that's mostly just luck as it is very close.

Flash attention should not reduce the maximum context length, where are you seeing that? Does it prevent the crashes when used with the new estimates? It should reduce memory usage by avoiding some of the allocations for intermediate states.

@jessegross commented on GitHub (Aug 21, 2025): I see the first iteration in your logs with the larger context now. One more piece of the layer allocations that I didn't mention is that the new estimates also explicitly specify which layers to offload on which GPU so it is no longer a simple in-order mapping. As a result, even though there are fewer layers on GPU 0, some of the layers are larger. Adding up all of the allocations on your version, I see 18G, whereas with the new estimates it is 18.4G. Probably the old default ordering enables a slightly more even packing than the new one but that's mostly just luck as it is very close. Flash attention should not reduce the maximum context length, where are you seeing that? Does it prevent the crashes when used with the new estimates? It should reduce memory usage by avoiding some of the allocations for intermediate states.

GiteaMirror commented

2026-05-04 20:06:58 -05:00

@gordan-bobic commented on GitHub (Aug 21, 2025):

the new estimates also explicitly specify which layers to offload on which GPU so it is no longer a simple in-order mapping.

The logs don't seem to show any out-of-order mapping, though:

Layers:26(0..25) ID:GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Layers:27(26..52) ID:GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 Layers:27(53..79) ID:GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Layers:21(80..100)]

Unless I am misunderstanding what you meant by that. I would also expect multiple passes back and forth between GPUs in a single cycle would be less efficient, than sequential layer splitting.

Flash attention should not reduce the maximum context length, where are you seeing that?

That's what I thought, but regardless of the amount of data in the context and the set ctx-size, the prompt_tokens/total_tokens as reported by Open-WebUI never exceeded about 27K, and I can see from the ollama runner parrameters (in 0.11.4) that ctx-size was set to 37K at the time. If I disable flash attention, it seems to go all the way up to the calibrated maximum I can achieve without OOM on 0.11.4 of about 29696.

But debugging OLLAMA_FLASH_ATTENTION behaviour is probably worthy of a separate ticket.

@gordan-bobic commented on GitHub (Aug 21, 2025): > the new estimates also explicitly specify which layers to offload on which GPU so it is no longer a simple in-order mapping. The logs don't seem to show any out-of-order mapping, though: ``` Layers:26(0..25) ID:GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Layers:27(26..52) ID:GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 Layers:27(53..79) ID:GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Layers:21(80..100)] ``` Unless I am misunderstanding what you meant by that. I would also expect multiple passes back and forth between GPUs in a single cycle would be less efficient, than sequential layer splitting. > Flash attention should not reduce the maximum context length, where are you seeing that? That's what I thought, but regardless of the amount of data in the context and the set ctx-size, the prompt_tokens/total_tokens as reported by Open-WebUI never exceeded about 27K, and I can see from the ollama runner parrameters (in 0.11.4) that ctx-size was set to 37K at the time. If I disable flash attention, it seems to go all the way up to the calibrated maximum I can achieve without OOM on 0.11.4 of about 29696. But debugging OLLAMA_FLASH_ATTENTION behaviour is probably worthy of a separate ticket.

GiteaMirror commented

2026-05-04 20:07:01 -05:00

@jessegross commented on GitHub (Aug 22, 2025):

the new estimates also explicitly specify which layers to offload on which GPU so it is no longer a simple in-order mapping.

The logs don't seem to show any out-of-order mapping, though:
Layers:26(0..25) ID:GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Layers:27(26..52) ID:GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 Layers:27(53..79) ID:GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Layers:21(80..100)]

Mapping the IDs to names you will see this is CUDA2, CUDA1, CUDA3, CUDA0. Previously it was always CUDA0, CUDA1, CUDA2, CUDA3.

Unless I am misunderstanding what you meant by that. I would also expect multiple passes back and forth between GPUs in a single cycle would be less efficient, than sequential layer splitting.

It's still sequential, just a different ordering. We don't consider PCIe topology and maybe the original enumeration order will group things better, though it probably doesn't make a difference for most consumer PCs. The new ordering is mostly an artifact of the allocation system. Using the original order would help your case but that's likely just luck.

@jessegross commented on GitHub (Aug 22, 2025): > > the new estimates also explicitly specify which layers to offload on which GPU so it is no longer a simple in-order mapping. > > The logs don't seem to show any out-of-order mapping, though: > > ``` > Layers:26(0..25) ID:GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Layers:27(26..52) ID:GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 Layers:27(53..79) ID:GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Layers:21(80..100)] > ``` Mapping the IDs to names you will see this is CUDA2, CUDA1, CUDA3, CUDA0. Previously it was always CUDA0, CUDA1, CUDA2, CUDA3. > Unless I am misunderstanding what you meant by that. I would also expect multiple passes back and forth between GPUs in a single cycle would be less efficient, than sequential layer splitting. It's still sequential, just a different ordering. We don't consider PCIe topology and maybe the original enumeration order will group things better, though it probably doesn't make a difference for most consumer PCs. The new ordering is mostly an artifact of the allocation system. Using the original order would help your case but that's likely just luck.

GiteaMirror commented

2026-05-04 20:07:03 -05:00

@gordan-bobic commented on GitHub (Aug 22, 2025):

Ah, I understand what you were referring to now.

The original point remains, though - there is definitely value in being able to override the --n-gpu-layers and --tensor-split parameters, because heuristics are never infallible.

@gordan-bobic commented on GitHub (Aug 22, 2025): Ah, I understand what you were referring to now. The original point remains, though - there is definitely value in being able to override the `--n-gpu-layers` and `--tensor-split` parameters, because heuristics are never infallible.

GiteaMirror commented

2026-05-04 20:07:04 -05:00

@morgwai commented on GitHub (Sep 1, 2025):

I have an asymmetric GPU setup: RTX-3090 24GB and GTX-1080ti 11GB, so in case of models sized 24-30GB I want to put as many layers as possible on 3090 and only the remaining ones on 1080ti for obvious performance reasons. By default ollama splits them to obtain roughly the same percentage of VRAM utilization on both cards (for example 18GB,8GB), which is waaay slower than if I specify --n-gpu-layers and --tensor-split manually as described by @gordan-bobic (thanks for your Altechnative article, man!).
Sometimes I need an even more elaborate tuning, for example I need a specific amount of VRAM to be left on a specific card to run some other stuff there.

Several ppl have asked for similar configuration options (for example 10172) or described how tuning these pre 0.11.5 params can improve performance (and I could keep providing more and more links of course...), but ollama team keeps insisting that they know better how ppl should run their workloads... Not everyone is Steve Jobs to get away with such an attitude: ppl will just migrate to llama.cpp, vLLM or ExLlamaV2 or whichever engine gives most flexibility (I'm just investigating it myself ATM).

@morgwai commented on GitHub (Sep 1, 2025): I have an asymmetric GPU setup: RTX-3090 24GB and GTX-1080ti 11GB, so in case of models sized 24-30GB I want to put as many layers as possible on 3090 and only the remaining ones on 1080ti for obvious performance reasons. By default `ollama` splits them to obtain roughly the same percentage of VRAM utilization on both cards (for example 18GB,8GB), which is waaay slower than if I specify `--n-gpu-layers` and `--tensor-split` manually as described by @gordan-bobic (thanks for your Altechnative article, man!). Sometimes I need an even more elaborate tuning, for example I need a specific amount of VRAM to be left on a specific card to run some other stuff there. Several ppl have asked for similar configuration options (for example [10172](https://github.com/ollama/ollama/issues/10172)) or described [how tuning these pre 0.11.5 params can improve performance](https://geekbacon.com/2025/05/03/understanding-vram-usage-in-ollama-with-large-models/) (and I could keep providing more and more links of course...), but `ollama` team keeps insisting that they know better how ppl should run their workloads... Not everyone is Steve Jobs to get away with such an attitude: ppl will just migrate to llama.cpp, vLLM or ExLlamaV2 or whichever engine gives most flexibility (I'm just investigating it myself ATM).

GiteaMirror commented

2026-05-04 20:07:06 -05:00

@gordan-bobic commented on GitHub (Sep 26, 2025):

The latest version (0.12.2) is spectacularly, hilariously bad at figuring out the split.

With a manual split on 0.11.4 I run hermes4:70b with full 128K context on 4x22GB GPUs with full 80/80 layer GPU offload.

With 0.12.2 it offloads only 18/80 layers to the GPU which makes the whole system completely unusable.

Removing options for manually overriding auto-detected settings is NEVER a good idea.

@gordan-bobic commented on GitHub (Sep 26, 2025): The latest version (0.12.2) is spectacularly, hilariously bad at figuring out the split. With a manual split on 0.11.4 I run hermes4:70b with full 128K context on 4x22GB GPUs with full 80/80 layer GPU offload. With 0.12.2 it offloads only 18/80 layers to the GPU which makes the whole system completely unusable. Removing options for manually overriding auto-detected settings is NEVER a good idea.

GiteaMirror commented

2026-05-04 20:07:08 -05:00

@rick-github commented on GitHub (Sep 26, 2025):

Server logs may aid in debugging.

@rick-github commented on GitHub (Sep 26, 2025): Server logs may aid in debugging.

GiteaMirror commented

2026-05-04 20:07:11 -05:00

@gordan-bobic commented on GitHub (Sep 26, 2025):

This ticket isn't about debugging layer splitting heuristics, it is about (re)adding a feature to facilitate manual override.

@gordan-bobic commented on GitHub (Sep 26, 2025): This ticket isn't about debugging layer splitting heuristics, it is about (re)adding a feature to facilitate manual override.

GiteaMirror commented

2026-05-04 20:07:13 -05:00

@jessegross commented on GitHub (Sep 26, 2025):

There was never functionality to allow manual control of the layer assignments, as the interface being manipulated by the scripts described here is internal and not publicly exposed.

Offloading will only get more complicated over time as we optimize memory usage and we don't want an ever expanding API, so we don't plan to more controls than we currently have.

@jessegross commented on GitHub (Sep 26, 2025): There was never functionality to allow manual control of the layer assignments, as the interface being manipulated by the scripts described here is internal and not publicly exposed. Offloading will only get more complicated over time as we optimize memory usage and we don't want an ever expanding API, so we don't plan to more controls than we currently have.

GiteaMirror commented

2026-05-04 20:07:15 -05:00

@gordan-bobic commented on GitHub (Sep 26, 2025):

As you have been optimizing memory usage, things seem to have been getting worse rather than better. I'll get a feature for this implemented and pull requested.

@gordan-bobic commented on GitHub (Sep 26, 2025): As you have been optimizing memory usage, things seem to have been getting worse rather than better. I'll get a feature for this implemented and pull requested.

GiteaMirror commented

2026-05-04 20:07:17 -05:00

@chrisoutwright commented on GitHub (Sep 28, 2025):

Also have completely uneven splits with same vram gpu: llama-3.3-nemotron-super-v1.5-q4km:49b
https://github.com/ollama/ollama/issues/7047#issuecomment-3342197917

llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free
...
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors:        CUDA0 model buffer size = 18993.09 MiB
load_tensors:        CUDA1 model buffer size =  9251.61 MiB
load_tensors:          CPU model buffer size =   563.62 MiB

using: (new Estimate or SchedSpred changes do not help)

$envNode2 = @{
    HostAddress = "0.0.0.0:11434"
    CUDA = "0,1"
    OllamaPath = "D:\Ollama\models"
    MaxLoadedModels = "0"
    NumParallel = "1"
    SchedSpred = "1"
    FlashAttention = "1"
    KeepAlive = "20m"
	NewEstimate = "0"
	KVCacheType = "q8_0"
    
}
#$envNode6,$envNode5,
# Set environment variables for each node and start the ollama serve command
foreach ($env in @($envNode2 )) {
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_HOST' -Value $env.HostAddress
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'CUDA_VISIBLE_DEVICES' -Value $env.CUDA
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_MODELS' -Value $env.OllamaPath
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_MAX_LOADED_MODELS' -Value $env.MaxLoadedModels
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_NUM_PARALLEL' -Value $env.NumParallel
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_SCHED_SPREAD' -Value $env.SchedSpred
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_FLASH_ATTENTION' -Value $env.FlashAttention
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_KEEP_ALIVE' -Value $env.KeepAlive
    
    # Add KV cache quantization setting (8-bit q8_0)
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_KV_CACHE_TYPE' -Value $env.KVCacheType
	Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_NEW_ESTIMATES' -Value $env.NewEstimate
	

    
    Start-Process powershell -ArgumentList "-Command `"`$env:OLLAMA_HOST='$($env.HostAddress)'; echo `$env:OLLAMA_HOST; `$env:CUDA_VISIBLE_DEVICES='$($env.CUDA)'; echo `$env:CUDA_VISIBLE_DEVICES; `$env:OLLAMA_MODELS='$($env.OllamaPath)'; echo `$env:OLLAMA_MODELS; `$env:OLLAMA_SCHED_SPREAD='$($env.SchedSpred)'; echo `$env:OLLAMA_SCHED_SPREAD; `$env:OLLAMA_FLASH_ATTENTION='$($env.FlashAttention)';  echo `$env:OLLAMA_FLASH_ATTENTION; `$env:OLLAMA_KEEP_ALIVE='$($env.KeepAlive)'; echo `$env:OLLAMA_KEEP_ALIVE; `$env:OLLAMA_KV_CACHE_TYPE='$($env.KVCacheType)'; echo `$env:OLLAMA_KV_CACHE_TYPE;`$env:OLLAMA_NEW_ESTIMATES='$($env.NewEstimate)'; echo `$env:OLLAMA_NEW_ESTIMATES; ollama serve; Read-Host 'Press any key to close the instance.'`"" -WindowStyle Normal -Verb RunAs
}

@chrisoutwright commented on GitHub (Sep 28, 2025): Also have completely uneven splits with same vram gpu: llama-3.3-nemotron-super-v1.5-q4km:49b https://github.com/ollama/ollama/issues/7047#issuecomment-3342197917 ``` llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free ... load_tensors: offloading 80 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 81/81 layers to GPU load_tensors: CUDA0 model buffer size = 18993.09 MiB load_tensors: CUDA1 model buffer size = 9251.61 MiB load_tensors: CPU model buffer size = 563.62 MiB ``` using: (new Estimate or SchedSpred changes do not help) ``` $envNode2 = @{ HostAddress = "0.0.0.0:11434" CUDA = "0,1" OllamaPath = "D:\Ollama\models" MaxLoadedModels = "0" NumParallel = "1" SchedSpred = "1" FlashAttention = "1" KeepAlive = "20m" NewEstimate = "0" KVCacheType = "q8_0" } #$envNode6,$envNode5, # Set environment variables for each node and start the ollama serve command foreach ($env in @($envNode2 )) { Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_HOST' -Value $env.HostAddress Set-ItemProperty -Path 'HKCU:\Environment' -Name 'CUDA_VISIBLE_DEVICES' -Value $env.CUDA Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_MODELS' -Value $env.OllamaPath Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_MAX_LOADED_MODELS' -Value $env.MaxLoadedModels Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_NUM_PARALLEL' -Value $env.NumParallel Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_SCHED_SPREAD' -Value $env.SchedSpred Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_FLASH_ATTENTION' -Value $env.FlashAttention Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_KEEP_ALIVE' -Value $env.KeepAlive # Add KV cache quantization setting (8-bit q8_0) Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_KV_CACHE_TYPE' -Value $env.KVCacheType Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_NEW_ESTIMATES' -Value $env.NewEstimate Start-Process powershell -ArgumentList "-Command `"`$env:OLLAMA_HOST='$($env.HostAddress)'; echo `$env:OLLAMA_HOST; `$env:CUDA_VISIBLE_DEVICES='$($env.CUDA)'; echo `$env:CUDA_VISIBLE_DEVICES; `$env:OLLAMA_MODELS='$($env.OllamaPath)'; echo `$env:OLLAMA_MODELS; `$env:OLLAMA_SCHED_SPREAD='$($env.SchedSpred)'; echo `$env:OLLAMA_SCHED_SPREAD; `$env:OLLAMA_FLASH_ATTENTION='$($env.FlashAttention)'; echo `$env:OLLAMA_FLASH_ATTENTION; `$env:OLLAMA_KEEP_ALIVE='$($env.KeepAlive)'; echo `$env:OLLAMA_KEEP_ALIVE; `$env:OLLAMA_KV_CACHE_TYPE='$($env.KVCacheType)'; echo `$env:OLLAMA_KV_CACHE_TYPE;`$env:OLLAMA_NEW_ESTIMATES='$($env.NewEstimate)'; echo `$env:OLLAMA_NEW_ESTIMATES; ollama serve; Read-Host 'Press any key to close the instance.'`"" -WindowStyle Normal -Verb RunAs } ```

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#70034