[PR #11525] [MERGED] Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution #12304

Closed
opened 2025-11-12 16:33:02 -06:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/11525
Author: @ORippler
Created: 7/25/2025
Status: Merged
Merged: 7/29/2025
Merged by: @mxyng

Base: mainHead: osimons/port-gemma3n-cg-to-ollama


📝 Commits (2)

  • a86286d Enable CUDA Graphs for gemma3n.
  • c4de3ea Remove residual check by reshaping differently in gemma3n model

📊 Changes

5 files changed (+67 additions, -10 deletions)

View changed files

📝 llama/patches/0019-metal-add-mean-kernel-14267.patch (+1 -1)
📝 llama/patches/0020-CUDA-add-mean-operation-14313.patch (+1 -1)
llama/patches/0021-Enable-CUDA-Graphs-for-gemma3n.patch (+50 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu (+12 -4)
📝 model/models/gemma3n/model_text.go (+3 -4)

📄 Description

This PR enables the execution of Gemma3n as CUDA Graphs on NVGPUs by porting https://github.com/ggml-org/llama.cpp/pull/14741 to ollama. Since the model graph is defined differently in ollama compared to llama.cpp, the heuristic used to identify and exclude the per_layer_projection from batch-size determination needed to be modified a bit. As a consequence, the patch will need to be maintained even after llama.cpp is updated to a commit that contains https://github.com/ggml-org/llama.cpp/pull/14741.

On a RTX PRO 6000 Max-Q under Windows, this PR improves perf by ~2.5x, see

Model Configuration Tokens/sec
gemma3n:e2b CG ON 103
gemma3n:e2b CG OFF 43
gemma3n:e4b CG ON 79
gemma3n:e4b CG OFF 35

Thanks @mxyng for providing changes to gemma3n model graph definition in c4de3eaa3e that make the checking more robust.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/11525 **Author:** [@ORippler](https://github.com/ORippler) **Created:** 7/25/2025 **Status:** ✅ Merged **Merged:** 7/29/2025 **Merged by:** [@mxyng](https://github.com/mxyng) **Base:** `main` ← **Head:** `osimons/port-gemma3n-cg-to-ollama` --- ### 📝 Commits (2) - [`a86286d`](https://github.com/ollama/ollama/commit/a86286db3b7296805942db46f5a9726fcc555eaa) Enable CUDA Graphs for gemma3n. - [`c4de3ea`](https://github.com/ollama/ollama/commit/c4de3eaa3e11eee48ed061ebbae5737f7a7cd2b2) Remove residual check by reshaping differently in gemma3n model ### 📊 Changes **5 files changed** (+67 additions, -10 deletions) <details> <summary>View changed files</summary> 📝 `llama/patches/0019-metal-add-mean-kernel-14267.patch` (+1 -1) 📝 `llama/patches/0020-CUDA-add-mean-operation-14313.patch` (+1 -1) ➕ `llama/patches/0021-Enable-CUDA-Graphs-for-gemma3n.patch` (+50 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu` (+12 -4) 📝 `model/models/gemma3n/model_text.go` (+3 -4) </details> ### 📄 Description This PR enables the execution of Gemma3n as CUDA Graphs on NVGPUs by porting https://github.com/ggml-org/llama.cpp/pull/14741 to ollama. Since the model graph is defined differently in ollama compared to llama.cpp, the heuristic used to identify and exclude the `per_layer_projection` from batch-size determination needed to be modified a bit. As a consequence, the patch will need to be maintained even after llama.cpp is updated to a commit that contains https://github.com/ggml-org/llama.cpp/pull/14741. On a RTX PRO 6000 Max-Q under Windows, this PR improves perf by ~2.5x, see Model | Configuration | Tokens/sec -- | -- | -- gemma3n:e2b | CG ON | 103 gemma3n:e2b | CG OFF | 43 gemma3n:e4b | CG ON | 79 gemma3n:e4b | CG OFF | 35 Thanks @mxyng for providing changes to gemma3n model graph definition in c4de3eaa3e11eee48ed061ebbae5737f7a7cd2b2 that make the checking more robust. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2025-11-12 16:33:02 -06:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#12304