[GH-ISSUE #2191] If you have multiple GPUs then the new default split_mode = "layer" option in the wrapped llama.cpp server may effect you alot! #63290

Closed
opened 2026-05-03 12:52:48 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @jukofyork on GitHub (Jan 25, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2191

I'm not sure why the llama.cpp devs have made the new default split_mode = "layer", but it runs MUCH worse for me and I only get around 60% of the tokens/s that I get with split_mode = "row" (using 2x RTX A6000 and an NvLink bridge).

The only difference I can see is that using split_mode = "layer" seems to allocate the VRAM much more evenly.

I'm also seeing nothing move through the NvLink bridge now unless I set main_gpu = 0, but this doesn't really have any effect on the poor performance of split_mode = "layer".

See https://github.com/ollama/ollama/pull/2190 for more details and the fix.

Originally created by @jukofyork on GitHub (Jan 25, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2191 I'm not sure why the `llama.cpp` devs have made the new default `split_mode = "layer"`, but it runs ***MUCH*** worse for me and I only get around 60% of the tokens/s that I get with `split_mode = "row"` (using 2x RTX A6000 and an NvLink bridge). The only difference I can see is that using `split_mode = "layer"` seems to allocate the VRAM much more evenly. I'm also seeing nothing move through the NvLink bridge now unless I set `main_gpu = 0`, but this doesn't really have any effect on the poor performance of `split_mode = "layer"`. See https://github.com/ollama/ollama/pull/2190 for more details and the fix.
GiteaMirror added the performancenvidia labels 2026-05-03 12:52:50 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63290