[PR #14150] fix(cuda): prevent integer overflow in RoPE kernels by using int64_t #45788

Open
opened 2026-04-25 01:25:42 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14150
Author: @BurakBebek1
Created: 2/8/2026
Status: 🔄 Open

Base: mainHead: fix/rope-integer-overflow


📝 Commits (1)

  • 84c9433 fix(cuda): prevent integer overflow in RoPE kernels by using int64_t

📊 Changes

1 file changed (+7 additions, -4 deletions)

View changed files

📝 ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu (+7 -4)

📄 Description

This fixes the GGML_ASSERT(a->ne[2] * 4 == b->ne[0]) failure when processing large tensors in vision models like glm-ocr.

This PR addresses a critical GGML_ASSERT failure occurring during the processing of large vision-based models (like glm-ocr) on systems with high VRAM availability. The issue stems from a 32-bit integer overflow when calculating tensor offsets in the RoPE (Rotary Positional Embedding) kernels.

The Bug

When processing high-resolution images or large batches, the tensor size/offset calculation exceeds the max value of a signed 32-bit integer (2^{31}-1). This leads to incorrect memory addressing, causing the assertion a->ne[2] * 4 == b->ne[0] to fail, followed by a SIGABRT.

The Fix

Changed the offset and dimension variables in the CUDA kernel dispatching logic (specifically in cpy.cu / ggml-cuda.cu) from int to int64_t. This ensures that memory offsets are calculated correctly even when the VRAM usage exceeds 2GB for a single tensor operation.

Steps to Reproduce (Verification)

  1. Environment: WSL2 / Ubuntu with NVIDIA RTX 4060 (8GB VRAM).
  2. Run glm-ocr model with a high-resolution image.
  3. Observe nvidia-smi reaching ~4.3GB VRAM.
  4. Without this patch: System crashes with GGML_ASSERT.
  5. With this patch: Model successfully generates output.

Before the Development

MError: an error was encountered while running the model: GGML_ASSERT(a->ne[2] * 4 == b->ne[0]) failed

After The development

./ollama run glm-ocr

/mnt/c/Users//Downloads/seo-analytics-website-performance.png Read the text in the picture
Added image '/mnt/c/Users/
/Downloads/seo-analytics-website-performance.png'
SEO AND ANALYTICS: MAXIMIZING
YOUR WEBSITE’S PERFORMANCE

image ---

This fix is especially important for Vision-Language Models (VLM) where image embeddings can generate large temporary tensors that easily cross the 2GB threshold.

Fixes: #14124


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14150 **Author:** [@BurakBebek1](https://github.com/BurakBebek1) **Created:** 2/8/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix/rope-integer-overflow` --- ### 📝 Commits (1) - [`84c9433`](https://github.com/ollama/ollama/commit/84c943351f4f258091ceed2439e22d0b18b21def) fix(cuda): prevent integer overflow in RoPE kernels by using int64_t ### 📊 Changes **1 file changed** (+7 additions, -4 deletions) <details> <summary>View changed files</summary> 📝 `ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu` (+7 -4) </details> ### 📄 Description This fixes the GGML_ASSERT(a->ne[2] * 4 == b->ne[0]) failure when processing large tensors in vision models like glm-ocr. This PR addresses a critical GGML_ASSERT failure occurring during the processing of large vision-based models (like glm-ocr) on systems with high VRAM availability. The issue stems from a 32-bit integer overflow when calculating tensor offsets in the RoPE (Rotary Positional Embedding) kernels. --- ### The Bug When processing high-resolution images or large batches, the tensor size/offset calculation exceeds the max value of a signed 32-bit integer ($2^{31}-1$). This leads to incorrect memory addressing, causing the assertion a->ne[2] * 4 == b->ne[0] to fail, followed by a SIGABRT. --- ### The Fix Changed the offset and dimension variables in the CUDA kernel dispatching logic (specifically in cpy.cu / ggml-cuda.cu) from int to int64_t. This ensures that memory offsets are calculated correctly even when the VRAM usage exceeds 2GB for a single tensor operation. --- ### Steps to Reproduce (Verification) 1. Environment: WSL2 / Ubuntu with NVIDIA RTX 4060 (8GB VRAM). 2. Run glm-ocr model with a high-resolution image. 3. Observe nvidia-smi reaching ~4.3GB VRAM. 4. Without this patch: System crashes with GGML_ASSERT. 5. With this patch: Model successfully generates output. --- ### Before the Development MError: an error was encountered while running the model: GGML_ASSERT(a->ne[2] * 4 == b->ne[0]) failed --- ### After The development ./ollama run glm-ocr >>> /mnt/c/Users/***/Downloads/seo-analytics-website-performance.png Read the text in the picture Added image '/mnt/c/Users/***/Downloads/seo-analytics-website-performance.png' SEO AND ANALYTICS: MAXIMIZING YOUR WEBSITE’S PERFORMANCE <img width="1081" height="626" alt="image" src="https://github.com/user-attachments/assets/75015558-e591-4aaf-980f-fbe0ee2e2649" /> --- This fix is especially important for Vision-Language Models (VLM) where image embeddings can generate large temporary tensors that easily cross the 2GB threshold. Fixes: #14124 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 01:25:42 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#45788