[PR #15978] [CLOSED] fix(ggml-cuda): gate INT_MAX asserts for contiguous tensors in ggml_cuda_cpy #77676

Closed
opened 2026-05-05 10:21:06 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15978
Author: @glennneuber
Created: 5/5/2026
Status: Closed

Base: mainHead: fix/ggml-cuda-cpy-int-max-contiguous


📝 Commits (10+)

  • 2d48cd8 feat(gemma4vision): add Gemma 4 visual token ladder helpers
  • 6c5e815 feat(api): add image_min_tokens and image_max_tokens to Options
  • d6b1fb5 feat(model): add MultimodalBudgetEncoder interface
  • 63f0d7a feat(gemma4): apply per-request vision token budgets
  • 82d5d8f feat(ollamarunner): plumb vision token budgets through completion
  • 41e382d fix(scheduler): reload GGUF runner when image token options change
  • 0c4a611 chore(mlxrunner): debug log when image token budgets are set
  • e563a3b docs(design): add Gemma 4 vision token budget design note
  • e90d7d9 docs(design): add PR body template for GitHub
  • 1108f16 fix(ggml-cuda): gate INT_MAX asserts on non-contiguous tensors in ggml_cuda_cpy

📊 Changes

16 files changed (+740 additions, -37 deletions)

View changed files

📝 CMakeLists.txt (+18 -0)
📝 api/types.go (+4 -0)
docs/design/PR_BODY.md (+17 -0)
docs/design/README.md (+3 -0)
docs/design/gemma4-vision-token-budgets.md (+422 -0)
internal/gemma4vision/budget.go (+67 -0)
internal/gemma4vision/budget_test.go (+41 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu (+9 -3)
📝 model/model.go (+6 -0)
📝 model/models/gemma4/model.go (+28 -6)
📝 model/models/gemma4/process_image.go (+35 -22)
model/models/gemma4/process_image_test.go (+45 -0)
📝 runner/ollamarunner/runner.go (+27 -5)
📝 server/sched.go (+5 -1)
📝 server/sched_test.go (+7 -0)
📝 x/mlxrunner/server.go (+6 -0)

📄 Description

Summary

GGML ggml_cuda_cpy (used by CUDA and HIP builds of the same sources) applied GGML_ASSERT(ggml_nbytes(src) <= INT_MAX) to every copy before dispatch. Contiguous same-type copies use cudaMemcpyAsync (size_t length); contiguous type-conversion uses ggml_cpy_scalar_contiguous_cuda (int64_t counts). The blanket assert aborted valid large contiguous copies.

Symptom

Gemma 4 worst-case vision graph reservation could exceed 2 GiB − 1 bytes on a single tensor and crash during ggml_backend_sched_reserve with GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed.

Changes

  • cpy.cu: Remove top-level nbytes <= INT_MAX asserts; apply ne / nbytes limits only when !contiguous_srcs.
  • CMakeLists.txt: Optional — prepend ROCM_PATH / /opt/rocm to CMAKE_PREFIX_PATH for HIP discovery in dev shells (happy to drop from upstream PR if undesired).

Rebuild note

Edit .cu → rebuild backend shared libs (cmake --build build), not only go build.

Testing

Gemma 4 + ROCm: model load, worst-case reserve, and vision with high image token budgets.

Made with Cursor


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15978 **Author:** [@glennneuber](https://github.com/glennneuber) **Created:** 5/5/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `fix/ggml-cuda-cpy-int-max-contiguous` --- ### 📝 Commits (10+) - [`2d48cd8`](https://github.com/ollama/ollama/commit/2d48cd835088391414e7b980d1cdaec443d740e3) feat(gemma4vision): add Gemma 4 visual token ladder helpers - [`6c5e815`](https://github.com/ollama/ollama/commit/6c5e81523cd3f9eb92e617e2a1442af4d7b65e97) feat(api): add image_min_tokens and image_max_tokens to Options - [`d6b1fb5`](https://github.com/ollama/ollama/commit/d6b1fb5252cc57785947c6109ed19fbfcc089c3a) feat(model): add MultimodalBudgetEncoder interface - [`63f0d7a`](https://github.com/ollama/ollama/commit/63f0d7a3ea3518223b1217a607a5a9a8951dd08f) feat(gemma4): apply per-request vision token budgets - [`82d5d8f`](https://github.com/ollama/ollama/commit/82d5d8f044b092a0805f15e0ab1cdab174010941) feat(ollamarunner): plumb vision token budgets through completion - [`41e382d`](https://github.com/ollama/ollama/commit/41e382da735ef2620d19135e5437901a874a3589) fix(scheduler): reload GGUF runner when image token options change - [`0c4a611`](https://github.com/ollama/ollama/commit/0c4a6118aaf659b0d6ecb8434ab447018895bf8b) chore(mlxrunner): debug log when image token budgets are set - [`e563a3b`](https://github.com/ollama/ollama/commit/e563a3bda47c62bac942b7def925139c987295c3) docs(design): add Gemma 4 vision token budget design note - [`e90d7d9`](https://github.com/ollama/ollama/commit/e90d7d95526ed5a5cb36b5be8553556d7561df7c) docs(design): add PR body template for GitHub - [`1108f16`](https://github.com/ollama/ollama/commit/1108f169fa01a56c6945cf25041ce36687de6d74) fix(ggml-cuda): gate INT_MAX asserts on non-contiguous tensors in ggml_cuda_cpy ### 📊 Changes **16 files changed** (+740 additions, -37 deletions) <details> <summary>View changed files</summary> 📝 `CMakeLists.txt` (+18 -0) 📝 `api/types.go` (+4 -0) ➕ `docs/design/PR_BODY.md` (+17 -0) ➕ `docs/design/README.md` (+3 -0) ➕ `docs/design/gemma4-vision-token-budgets.md` (+422 -0) ➕ `internal/gemma4vision/budget.go` (+67 -0) ➕ `internal/gemma4vision/budget_test.go` (+41 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu` (+9 -3) 📝 `model/model.go` (+6 -0) 📝 `model/models/gemma4/model.go` (+28 -6) 📝 `model/models/gemma4/process_image.go` (+35 -22) ➕ `model/models/gemma4/process_image_test.go` (+45 -0) 📝 `runner/ollamarunner/runner.go` (+27 -5) 📝 `server/sched.go` (+5 -1) 📝 `server/sched_test.go` (+7 -0) 📝 `x/mlxrunner/server.go` (+6 -0) </details> ### 📄 Description ## Summary GGML `ggml_cuda_cpy` (used by CUDA and HIP builds of the same sources) applied `GGML_ASSERT(ggml_nbytes(src) <= INT_MAX)` to every copy before dispatch. Contiguous same-type copies use `cudaMemcpyAsync` (`size_t` length); contiguous type-conversion uses `ggml_cpy_scalar_contiguous_cuda` (`int64_t` counts). The blanket assert aborted valid large contiguous copies. ## Symptom Gemma 4 worst-case vision graph reservation could exceed **2 GiB − 1** bytes on a single tensor and crash during `ggml_backend_sched_reserve` with `GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed`. ## Changes - **cpy.cu**: Remove top-level `nbytes <= INT_MAX` asserts; apply `ne` / `nbytes` limits only when `!contiguous_srcs`. - **CMakeLists.txt**: Optional — prepend `ROCM_PATH` / `/opt/rocm` to `CMAKE_PREFIX_PATH` for HIP discovery in dev shells (happy to drop from upstream PR if undesired). ## Rebuild note Edit `.cu` → rebuild backend shared libs (`cmake --build build`), not only `go build`. ## Testing Gemma 4 + ROCm: model load, worst-case reserve, and vision with high image token budgets. Made with [Cursor](https://cursor.com) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 10:21:06 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#77676