[PR #2190] [CLOSED] Implement split_mode and tensor_split support in modelfiles #10815

Closed
opened 2026-04-12 23:11:41 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/2190
Author: @jukofyork
Created: 1/25/2024
Status: Closed

Base: mainHead: main


📝 Commits (10+)

📊 Changes

6 files changed (+74 additions, -2 deletions)

View changed files

📝 api/types.go (+5 -0)
📝 docs/api.md (+2 -0)
📝 docs/modelfile.md (+3 -0)
📝 llm/dyn_ext_server.go (+6 -0)
📝 llm/ext_server/ext_server.cpp (+53 -1)
📝 llm/ext_server/ext_server.h (+5 -1)

📄 Description

This adds support for the new split_mode option in llama.cpp::server.

It has three possible values, and from llama.cpp::server --help:

How to split the model across multiple GPUs, one of:

  • "layer": split layers and KV across GPUs (default).
  • "row": split rows across GPUs.
  • "none": use one GPU only.

It also changes the meaning of the main_gpu parameter:

The GPU to use for the model (with split_mode = "none") or for intermediate results and KV (with split_mode = "row").

I've found experimentally (using nvidia-smi to look at the NvLink bus) that setting main_gpu = 0 (rather than leaving as the default) also seems to effect the "layer" option even though it doesn't say that in the --help output.

The new default of split_mode = "layer" runs MUCH worse for me and I only get around 60% of the tokens/s that I get with split_mode = "row" (using 2x RTX A6000 and an NvLink bridge).

The only difference I can see is that using split_mode = "layer" seems to allocate the VRAM much more evenly (NOTE: this may also effect the new code somebody is writing in llm.go for the num_gpu = -1 calculation!).

I've also got tensor_split working again (the https://github.com/ollama/ollama/pull/1256 pull request no longer works due to changes in the way parameters are now directly passed to the wrapped server, as opposed -- command line options).

I've just left the split_mode and tensor_split parameters to get read as strings and passed through the code-base without any error checking (which is inline with users being allowed to set bad/invalid num_gpu options, etc).

I've tested the code as best I can on 2x RTX A6000 and an NvLink bridge; with all 3 different split_mode options appearing to work as intended and tensor_split also appearing to work as intended, but I can't guarantee these changes will definitely work for others with different numbers of GPUs, etc.


I have lifted the parsing code from llama.cpp::server::server_params_parse() with these 2 additions:

  • Silently treat invalid values of split_mode as the default of split_mode = "layer".
  • Silently catch any exceptions generated by std::stof (ie: when trying to parse invalid values of tensor_split) and replace with 0.0f.

This seemed the most sensible option to me, as we have no feedback from the Ollama server like we do from llama.cpp::server when passing invalid command line options, but feel free to add error checking earlier in the chain if needed.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/2190 **Author:** [@jukofyork](https://github.com/jukofyork) **Created:** 1/25/2024 **Status:** ❌ Closed **Base:** `main` ← **Head:** `main` --- ### 📝 Commits (10+) - [`1e14d5c`](https://github.com/ollama/ollama/commit/1e14d5cd0eeb4fb78b835fe73aede7a84ef8359f) Update types.go - [`46db6c4`](https://github.com/ollama/ollama/commit/46db6c46d162fccc94124aef4aa73fbba63db8f1) Update api.md - [`ee644b2`](https://github.com/ollama/ollama/commit/ee644b22aa26ccff7cc32cc3efdc3ec8128b55ea) Update modelfile.md - [`c4d8f7a`](https://github.com/ollama/ollama/commit/c4d8f7ae7f94e522404ef1e32bb7706eef0465a5) Update dyn_ext_server.go - [`311ffda`](https://github.com/ollama/ollama/commit/311ffdaa877684ad6158291f96e3625a2d343f2b) Update types.go - [`5042b82`](https://github.com/ollama/ollama/commit/5042b82141d41846478aa04dec4b7a15588f06bd) Update ext_server.h - [`b9cfce9`](https://github.com/ollama/ollama/commit/b9cfce970214299e9eb2799516a42c95b74b0a73) Update ext_server.cpp - [`ab55ae1`](https://github.com/ollama/ollama/commit/ab55ae1a590a109320f2e4a0f053e55c63614e76) Update dyn_ext_server.go - [`a03342f`](https://github.com/ollama/ollama/commit/a03342f734153134fc4e0142bd9024d44278caf0) Update api.md - [`f31b657`](https://github.com/ollama/ollama/commit/f31b6570272b41544ab85f3be570e353fa61dbc1) Update types.go ### 📊 Changes **6 files changed** (+74 additions, -2 deletions) <details> <summary>View changed files</summary> 📝 `api/types.go` (+5 -0) 📝 `docs/api.md` (+2 -0) 📝 `docs/modelfile.md` (+3 -0) 📝 `llm/dyn_ext_server.go` (+6 -0) 📝 `llm/ext_server/ext_server.cpp` (+53 -1) 📝 `llm/ext_server/ext_server.h` (+5 -1) </details> ### 📄 Description This adds support for the new `split_mode` option in `llama.cpp::server`. It has three possible values, and from `llama.cpp::server --help`: > How to split the model across multiple GPUs, one of: > - "layer": split layers and KV across GPUs (default). > - "row": split rows across GPUs. > - "none": use one GPU only. It also changes the meaning of the `main_gpu` parameter: > The GPU to use for the model (with split_mode = "none") or for intermediate results and KV (with split_mode = "row"). I've found experimentally (using `nvidia-smi` to look at the NvLink bus) that setting `main_gpu = 0` (rather than leaving as the default) also seems to effect the "layer" option even though it doesn't say that in the `--help` output. The new default of `split_mode = "layer"` runs ***MUCH*** worse for me and I only get around 60% of the tokens/s that I get with `split_mode = "row"` (using 2x RTX A6000 and an NvLink bridge). The only difference I can see is that using `split_mode = "layer"` seems to allocate the VRAM much more evenly (**NOTE:** this may also effect the new code somebody is writing in `llm.go` for the `num_gpu = -1` calculation!). I've also got `tensor_split` working again (the https://github.com/ollama/ollama/pull/1256 pull request no longer works due to changes in the way parameters are now directly passed to the wrapped server, as opposed `--` command line options). I've just left the `split_mode` and `tensor_split` parameters to get read as strings and passed through the code-base without any error checking (which is inline with users being allowed to set bad/invalid `num_gpu` options, etc). I've tested the code as best I can on 2x RTX A6000 and an NvLink bridge; with all 3 different `split_mode` options appearing to work as intended and `tensor_split` also appearing to work as intended, but I can't guarantee these changes will definitely work for others with different numbers of GPUs, etc. ---- I have lifted the parsing code from `llama.cpp::server::server_params_parse()` with these 2 additions: - Silently treat invalid values of `split_mode` as the default of `split_mode = "layer"`. - Silently catch any exceptions generated by `std::stof` (ie: when trying to parse invalid values of `tensor_split`) and replace with `0.0f`. This seemed the most sensible option to me, as we have no feedback from the Ollama server like we do from `llama.cpp::server` when passing invalid command line options, but feel free to add error checking earlier in the chain if needed. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-12 23:11:41 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#10815