[PR #14536] [CLOSED] model/qwen3next: fix CUDA crash with partial GPU offload in DeltaNet attention #45964

Closed
opened 2026-04-25 01:33:05 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14536
Author: @yossiovadia
Created: 3/1/2026
Status: Closed

Base: mainHead: fix/deltanet-partial-offload-cuda-crash


📝 Commits (1)

  • 03867a3 model/qwen3next: fix CUDA crash with partial GPU offload in DeltaNet attention

📊 Changes

1 file changed (+18 additions, -8 deletions)

View changed files

📝 model/models/qwen3next/deltanet.go (+18 -8)

📄 Description

Fixes https://github.com/ollama/ollama/issues/14444

Problem

Models using the qwen3next architecture (qwen3.5:35b-a3b, qwen3.5:122b-a10b, etc.) crash with CUDA error: invalid argument when partially offloaded — i.e., when VRAM is insufficient for all layers and some remain on CPU.

CUDA error: invalid argument
  current device: 0, in function ggml_cuda_cpy at cpy.cu:438
  cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream)

The crash occurs during prompt evaluation on prompts long enough to fill multiple DeltaNet chunks.

Root cause

deltanet.go:478 uses SetInplace in the chunked attention loop:

v = v.SetInplace(ctx, coreAttnOutChunk, ...)

ggml_set_inplace creates GGML_OP_SET with result = ggml_view_tensor(ctx, v). Since v is an intermediate tensor with no pre-allocated buffer, the view also has no buffer. The ggml backend scheduler cannot determine the correct backend from the view — ggml_backend_sched_backend_id_from_cur checks view_src->buffer, finds NULL, and returns -1 (unassigned).

The scheduler's expansion pass then assigns the SET node based on adjacent graph nodes. With partial offload (e.g., layer 0 on CPU, layers 1-39 on GPU), GPU assignments from neighboring layers leak into CPU-layer DeltaNet operations. The CUDA backend receives a GGML_OP_SET with a host-memory pointer, and set.cu → ggml_cuda_cpy → cudaMemcpyDeviceToDevice crashes.

Fix

Replace SetInplace with Concat. Instead of writing each chunk into v at an offset, collect chunk outputs in a slice and concatenate along the chunk dimension after the loop.

Concat creates GGML_OP_CONCAT — a fresh tensor (not a view of an unallocated intermediate), whose backend is correctly inferred from its inputs, which all belong to the same layer's computation.

Test results

RTX 4090 24 GB, qwen3.5:35b-a3b (39/41 layers on GPU), CUDA 12.0, num_ctx=4096.

Before fix — long prompt triggers crash:

offloaded 39/41 layers to GPU

CUDA error: invalid argument
  current device: 0, in function ggml_cuda_cpy at cpy.cu:438

After fix — three consecutive prompts succeed, zero errors:

offloaded 39/41 layers to GPU

Prompt 1 (Hello, how are you?): exit 0
Prompt 2 (Talk to me like The Dude): exit 0
Prompt 3 (long essay — same prompt that crashed before): exit 0
CUDA errors in log: 0

No changes to any C/CUDA code. cpy.cu still uses cudaMemcpyDeviceToDevice — the fix is purely in the Go model code.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14536 **Author:** [@yossiovadia](https://github.com/yossiovadia) **Created:** 3/1/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `fix/deltanet-partial-offload-cuda-crash` --- ### 📝 Commits (1) - [`03867a3`](https://github.com/ollama/ollama/commit/03867a3b77d9d6b375fbf16dace1a5b5f25aa7c9) model/qwen3next: fix CUDA crash with partial GPU offload in DeltaNet attention ### 📊 Changes **1 file changed** (+18 additions, -8 deletions) <details> <summary>View changed files</summary> 📝 `model/models/qwen3next/deltanet.go` (+18 -8) </details> ### 📄 Description Fixes https://github.com/ollama/ollama/issues/14444 ## Problem Models using the `qwen3next` architecture (qwen3.5:35b-a3b, qwen3.5:122b-a10b, etc.) crash with `CUDA error: invalid argument` when partially offloaded — i.e., when VRAM is insufficient for all layers and some remain on CPU. ``` CUDA error: invalid argument current device: 0, in function ggml_cuda_cpy at cpy.cu:438 cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream) ``` The crash occurs during prompt evaluation on prompts long enough to fill multiple DeltaNet chunks. ## Root cause `deltanet.go:478` uses `SetInplace` in the chunked attention loop: ```go v = v.SetInplace(ctx, coreAttnOutChunk, ...) ``` `ggml_set_inplace` creates `GGML_OP_SET` with `result = ggml_view_tensor(ctx, v)`. Since `v` is an intermediate tensor with no pre-allocated buffer, the view also has no buffer. The ggml backend scheduler cannot determine the correct backend from the view — `ggml_backend_sched_backend_id_from_cur` checks `view_src->buffer`, finds NULL, and returns -1 (unassigned). The scheduler's expansion pass then assigns the SET node based on adjacent graph nodes. With partial offload (e.g., layer 0 on CPU, layers 1-39 on GPU), GPU assignments from neighboring layers leak into CPU-layer DeltaNet operations. The CUDA backend receives a `GGML_OP_SET` with a host-memory pointer, and `set.cu → ggml_cuda_cpy → cudaMemcpyDeviceToDevice` crashes. ## Fix Replace `SetInplace` with `Concat`. Instead of writing each chunk into `v` at an offset, collect chunk outputs in a slice and concatenate along the chunk dimension after the loop. `Concat` creates `GGML_OP_CONCAT` — a fresh tensor (not a view of an unallocated intermediate), whose backend is correctly inferred from its inputs, which all belong to the same layer's computation. ## Test results RTX 4090 24 GB, qwen3.5:35b-a3b (39/41 layers on GPU), CUDA 12.0, `num_ctx=4096`. **Before fix** — long prompt triggers crash: ``` offloaded 39/41 layers to GPU CUDA error: invalid argument current device: 0, in function ggml_cuda_cpy at cpy.cu:438 ``` **After fix** — three consecutive prompts succeed, zero errors: ``` offloaded 39/41 layers to GPU Prompt 1 (Hello, how are you?): exit 0 Prompt 2 (Talk to me like The Dude): exit 0 Prompt 3 (long essay — same prompt that crashed before): exit 0 CUDA errors in log: 0 ``` No changes to any C/CUDA code. `cpy.cu` still uses `cudaMemcpyDeviceToDevice` — the fix is purely in the Go model code. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 01:33:05 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#45964