[PR #14804] llama: bypass CPU host fallback for tied embeddings on unified memory (Gemma3) #20117

Open
opened 2026-04-16 07:27:08 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14804
Author: @caffeinatedbits
Created: 3/12/2026
Status: 🔄 Open

Base: mainHead: pr/fix-gemma3-tied-embeddings


📝 Commits (1)

  • 80f012a ggml: force tied embeddings to VRAM for unified memory architectures

📊 Changes

3 files changed (+38 additions, -4 deletions)

View changed files

📝 llama/llama.cpp/src/llama-model.cpp (+4 -0)
llama/patches/0036-llama-bypass-cpu-host-fallback-for-tied-embeddings-o.patch (+25 -0)
📝 ml/backend/ggml/ggml.go (+9 -4)

📄 Description

Models utilizing tied embeddings (e.g., Gemma-3) share the token_embd.weight tensor with the final output classification layer.

Currently, both the Go frontend scheduler and the vendored ggml backend contain legacy routing logic designed to save VRAM on discrete GPUs. This logic forcefully intercepts these shared output tensors and isolates them in CPU memory (CUDA_Host or unpinned system RAM).

On Unified Memory APUs, this hardware-agnostic fallback creates a catastrophic bottleneck. It forces the final, massive logit multiplication (e.g., 5376 x 262208) onto a single CPU thread, crippling inference speeds and causing the GPU ALUs to idle in a spinlock.

Changes:

  • ml/backend/ggml/ggml.go: Modified the layer routing switch to intercept token_embd.weight and explicitly map it to the GPU allocation bucket (output.bts) rather than the CPU fallback bucket (input.bts, -1).
  • llama/vendor/src/llama-model.cpp: Injected an absolute override during buffer assignment to neutralize the Go frontend's regex memory estimator, forcefully stapling LLM_TENSOR_TOKEN_EMBD and LLM_TENSOR_OUTPUT to the primary GPU compute layer.
  • Generated llama/patches/0036-... using make -f Makefile.sync.

Results:

Tested on an NVIDIA DGX Spark (Grace-Blackwell GB10 APU, Compute 12.1, 128GB Unified Memory) running both a quantized (q8_0) and unquantized (bf16) Gemma-3 27B model.

  • 100% of the CPU fallback routing is eliminated.
  • The 2.6GB tied embedding tensor is now allocated seamlessly alongside the transformer blocks directly in CUDA0 VRAM.
  • Single-thread CPU pinning during generation is entirely resolved, allowing the GPU to fully execute the final classification layer natively on the silicon.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14804 **Author:** [@caffeinatedbits](https://github.com/caffeinatedbits) **Created:** 3/12/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `pr/fix-gemma3-tied-embeddings` --- ### 📝 Commits (1) - [`80f012a`](https://github.com/ollama/ollama/commit/80f012ac2e1a96afd55bb5e5c1ab5504d366cbc6) ggml: force tied embeddings to VRAM for unified memory architectures ### 📊 Changes **3 files changed** (+38 additions, -4 deletions) <details> <summary>View changed files</summary> 📝 `llama/llama.cpp/src/llama-model.cpp` (+4 -0) ➕ `llama/patches/0036-llama-bypass-cpu-host-fallback-for-tied-embeddings-o.patch` (+25 -0) 📝 `ml/backend/ggml/ggml.go` (+9 -4) </details> ### 📄 Description Models utilizing tied embeddings (e.g., Gemma-3) share the token_embd.weight tensor with the final output classification layer. Currently, both the Go frontend scheduler and the vendored ggml backend contain legacy routing logic designed to save VRAM on discrete GPUs. This logic forcefully intercepts these shared output tensors and isolates them in CPU memory (CUDA_Host or unpinned system RAM). On Unified Memory APUs, this hardware-agnostic fallback creates a catastrophic bottleneck. It forces the final, massive logit multiplication (e.g., 5376 x 262208) onto a single CPU thread, crippling inference speeds and causing the GPU ALUs to idle in a spinlock. **Changes:** - `ml/backend/ggml/ggml.go`: Modified the layer routing switch to intercept `token_embd.weight` and explicitly map it to the GPU allocation bucket (`output.bts`) rather than the CPU fallback bucket (`input.bts`, `-1`). - `llama/vendor/src/llama-model.cpp`: Injected an absolute override during buffer assignment to neutralize the Go frontend's regex memory estimator, forcefully stapling `LLM_TENSOR_TOKEN_EMBD` and `LLM_TENSOR_OUTPUT` to the primary GPU compute layer. - Generated `llama/patches/0036-...` using `make -f Makefile.sync`. **Results:** Tested on an NVIDIA DGX Spark (Grace-Blackwell GB10 APU, Compute 12.1, 128GB Unified Memory) running both a quantized (`q8_0`) and unquantized (`bf16`) Gemma-3 27B model. - 100% of the CPU fallback routing is eliminated. - The 2.6GB tied embedding tensor is now allocated seamlessly alongside the transformer blocks directly in CUDA0 VRAM. - Single-thread CPU pinning during generation is entirely resolved, allowing the GPU to fully execute the final classification layer natively on the silicon. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 07:27:08 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#20117