[PR #7274] [CLOSED] Add Environment Variable For Row Split and No KV Offload #12371

Closed
opened 2026-04-12 23:57:18 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/7274
Author: @heislera763
Created: 10/20/2024
Status: Closed

Base: mainHead: main


📝 Commits (2)

  • f0a97fc Implemented split-mode row and nkvo
  • 5013279 Merge branch 'ollama:main' into main

📊 Changes

4 files changed (+21 additions, -0 deletions)

View changed files

📝 cmd/cmd.go (+2 -0)
📝 envconfig/config.go (+6 -0)
📝 llm/ext_server/server.cpp (+5 -0)
📝 llm/server.go (+8 -0)

📄 Description

This is https://github.com/ollama/ollama/pull/5527 (add "--split-mode row" parameter) but rebased and cleaned up. I've also added the "--no-kv-offload" parameter, which was discussed as a workaround to all KV cache being placed on the first GPU when using split rows. These parameters are activated with the new environment variables OLLAMA_SPLIT_MODE_ROW and OLLAMA_NO_KV_OFFLOAD.

One of the known caveats to --no-kv-offload is that will reduce performance (seemingly proportional to the length of your prompt). However, something odd I noticed in my testing was that performance seems to scale inversely with the "num_thread" parameter, which meant that performance was best with just 1 thread. It's not clear why this is, especially since a local llama.cpp instance I tested against didn't have this behavior. Another issue is that there seems to be some incorrect gpu allocation when you set high context lengths. I think what's happening is that memory.go is trying to determine the optimal gpu layer count, but the formula isn't accounting for the fact that the KV cache isn't going to use any GPU memory. I'm seemingly able to work around this issue by manually setting "num_gpu" to a higher value.

I'm not sure how to go about addressing either of these issues, but I think it worth adding these environment variables in their current state since they can be very beneficial to multi-gpu users, especially row split. If the jankiness introduced by no-kv-offload is too much maybe I can cut that from this PR and try to pursue this idea https://github.com/ollama/ollama/pull/5527#issuecomment-2330289372 which seems like it could probably work, although I can't say I'm familiar with how simple/complex memory.go is.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/7274 **Author:** [@heislera763](https://github.com/heislera763) **Created:** 10/20/2024 **Status:** ❌ Closed **Base:** `main` ← **Head:** `main` --- ### 📝 Commits (2) - [`f0a97fc`](https://github.com/ollama/ollama/commit/f0a97fca42f47c565451b63583b8cea8615e7c03) Implemented split-mode row and nkvo - [`5013279`](https://github.com/ollama/ollama/commit/5013279dec8d7d0575b68ff2ea63114fa3247c84) Merge branch 'ollama:main' into main ### 📊 Changes **4 files changed** (+21 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `cmd/cmd.go` (+2 -0) 📝 `envconfig/config.go` (+6 -0) 📝 `llm/ext_server/server.cpp` (+5 -0) 📝 `llm/server.go` (+8 -0) </details> ### 📄 Description This is https://github.com/ollama/ollama/pull/5527 (add "--split-mode row" parameter) but rebased and cleaned up. I've also added the "--no-kv-offload" parameter, which was discussed as a workaround to all KV cache being placed on the first GPU when using split rows. These parameters are activated with the new environment variables OLLAMA_SPLIT_MODE_ROW and OLLAMA_NO_KV_OFFLOAD. One of the known caveats to --no-kv-offload is that will reduce performance (seemingly proportional to the length of your prompt). However, something odd I noticed in my testing was that performance seems to scale inversely with the "num_thread" parameter, which meant that performance was best with just 1 thread. It's not clear why this is, especially since a local llama.cpp instance I tested against didn't have this behavior. Another issue is that there seems to be some incorrect gpu allocation when you set high context lengths. I think what's happening is that memory.go is trying to determine the optimal gpu layer count, but the formula isn't accounting for the fact that the KV cache isn't going to use any GPU memory. I'm seemingly able to work around this issue by manually setting "num_gpu" to a higher value. I'm not sure how to go about addressing either of these issues, but I think it worth adding these environment variables in their current state since they can be very beneficial to multi-gpu users, especially row split. If the jankiness introduced by no-kv-offload is too much maybe I can cut that from this PR and try to pursue this idea https://github.com/ollama/ollama/pull/5527#issuecomment-2330289372 which seems like it could probably work, although I can't say I'm familiar with how simple/complex memory.go is. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-12 23:57:18 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#12371