[GH-ISSUE #14843] x/flux2-klein: panic in applyRoPEQwen3 — x.Shape() returns empty slice on Windows/CUDA (MLX runner) #56088

Open
opened 2026-04-29 10:15:03 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @JaromirKuba on GitHub (Mar 14, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14843

<html> <html><head></head>

What happened

When running x/flux2-klein:latest on Windows with CUDA backend, the MLX runner crashes with a Go panic during inference. The model loads successfully (5.3 GB VRAM), but generation immediately fails regardless of prompt length or engine settings.

Expected behavior

Image generation completes successfully, as it presumably does on macOS/Metal.

Actual behavior

panic: runtime error: index out of range [0] with length 0
goroutine 9 [running]:
github.com/ollama/ollama/x/imagegen/models/qwen3.applyRoPEQwen3(...)
    x/imagegen/models/qwen3/text_encoder.go:47

Line 47 is B := shape[0] where shape := x.Shape() — meaning x.Shape() returned an empty slice []int32{} for the input tensor.

Call chain

flux2.Model.generate
  → qwen3.TextEncoder.EncodePromptWithLayers
    → qwen3.TextEncoder.ForwardWithLayerOutputs
      → qwen3.Block.Forward
        → qwen3.Attention.Forward
          → applyRoPEQwen3   ← panic here

Root cause hypothesis

The MLX CUDA binding for .Shape() may not correctly return tensor shape for arrays created or passed on the CUDA side, while the Metal backend handles this correctly. The function applyRoPEQwen3 itself appears correct — the issue is that the tensor x arriving from (*Attention).Forward is uninitialized or has no shape populated.

Relevant function (text_encoder.go)

func applyRoPEQwen3(x *mlx.Array, seqLen int32, theta float32) *mlx.Array {
    shape := x.Shape()
    B := shape[0]  // ← panic: index out of range, shape is []int32{}
    L := shape[1]
    H := shape[2]
    D := shape[3]
    ...
}

Environment

   
Ollama version 0.18.0
OS Windows (amd64)
GPU NVIDIA GeForce RTX 3050 6GB Laptop GPU
Compute capability 8.6
CUDA driver 13.1
VRAM total / available 6.0 GiB / 5.7 GiB
Model x/flux2-klein:latest
Model loaded Yes (2.65–2.80s, 5.3 GB VRAM)
Crash point First inference call

Startup log (relevant excerpt)

starting mlx runner subprocess  model=x/flux2-klein:latest  mode=imagegen
MLX library initialized
detected image model type  type=Flux2KleinPipeline
Loading FLUX.2 Klein model...
  Loading tokenizer... ✓
  Loading text encoder... ✓
  Loading transformer... ✓
  Loading VAE... ✓
  Evaluating weights... ✓
  Loaded in 2.65s (5.3 GB VRAM)
mlx runner listening  addr=127.0.0.1:xxxxx
[panic as above]

Also tested

  • OLLAMA_NEW_ENGINE=true → identical panic, identical stack trace. Confirms MLX runner is unaffected by this flag.
  • Short prompt "cat" (3 chars) → identical panic. Confirms crash is independent of prompt length or tokenization.

Both tests produce x.Shape() returning []int32{} unconditionally, which means the issue is not in prompt processing logic but in the MLX CUDA binding for Shape() itself.

Additional notes

  • Likely not reproducible on Apple Silicon / Metal due to different MLX backend implementation
  • The model loads and all components initialize correctly — crash only occurs at inference time when tensor shape is first accessed
</html> </html>
Originally created by @JaromirKuba on GitHub (Mar 14, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14843 <html><body> <!--StartFragment--><html><head></head><body> <p><strong>What happened</strong></p> <p>When running <code>x/flux2-klein:latest</code> on Windows with CUDA backend, the MLX runner crashes with a Go panic during inference. The model loads successfully (5.3 GB VRAM), but generation immediately fails regardless of prompt length or engine settings.</p> <p><strong>Expected behavior</strong></p> <p>Image generation completes successfully, as it presumably does on macOS/Metal.</p> <p><strong>Actual behavior</strong></p> <pre><code>panic: runtime error: index out of range [0] with length 0 goroutine 9 [running]: github.com/ollama/ollama/x/imagegen/models/qwen3.applyRoPEQwen3(...) x/imagegen/models/qwen3/text_encoder.go:47 </code></pre> <p>Line 47 is <code>B := shape[0]</code> where <code>shape := x.Shape()</code> — meaning <code>x.Shape()</code> returned an empty slice <code>[]int32{}</code> for the input tensor.</p> <p><strong>Call chain</strong></p> <pre><code>flux2.Model.generate → qwen3.TextEncoder.EncodePromptWithLayers → qwen3.TextEncoder.ForwardWithLayerOutputs → qwen3.Block.Forward → qwen3.Attention.Forward → applyRoPEQwen3 ← panic here </code></pre> <p><strong>Root cause hypothesis</strong></p> <p>The MLX CUDA binding for <code>.Shape()</code> may not correctly return tensor shape for arrays created or passed on the CUDA side, while the Metal backend handles this correctly. The function <code>applyRoPEQwen3</code> itself appears correct — the issue is that the tensor <code>x</code> arriving from <code>(*Attention).Forward</code> is uninitialized or has no shape populated.</p> <p><strong>Relevant function (text_encoder.go)</strong></p> <pre><code class="language-go">func applyRoPEQwen3(x *mlx.Array, seqLen int32, theta float32) *mlx.Array { shape := x.Shape() B := shape[0] // ← panic: index out of range, shape is []int32{} L := shape[1] H := shape[2] D := shape[3] ... } </code></pre> <p><strong>Environment</strong></p>   |   -- | -- Ollama version | 0.18.0 OS | Windows (amd64) GPU | NVIDIA GeForce RTX 3050 6GB Laptop GPU Compute capability | 8.6 CUDA driver | 13.1 VRAM total / available | 6.0 GiB / 5.7 GiB Model | x/flux2-klein:latest Model loaded | ✅ Yes (2.65–2.80s, 5.3 GB VRAM) Crash point | First inference call <p><strong>Startup log (relevant excerpt)</strong></p> <pre><code>starting mlx runner subprocess model=x/flux2-klein:latest mode=imagegen MLX library initialized detected image model type type=Flux2KleinPipeline Loading FLUX.2 Klein model... Loading tokenizer... ✓ Loading text encoder... ✓ Loading transformer... ✓ Loading VAE... ✓ Evaluating weights... ✓ Loaded in 2.65s (5.3 GB VRAM) mlx runner listening addr=127.0.0.1:xxxxx [panic as above] </code></pre> <p><strong>Also tested</strong></p> <ul> <li><code>OLLAMA_NEW_ENGINE=true</code> → identical panic, identical stack trace. Confirms MLX runner is unaffected by this flag.</li> <li>Short prompt <code>"cat"</code> (3 chars) → identical panic. Confirms crash is independent of prompt length or tokenization.</li> </ul> <p>Both tests produce <code>x.Shape()</code> returning <code>[]int32{}</code> unconditionally, which means the issue is not in prompt processing logic but in the MLX CUDA binding for <code>Shape()</code> itself.</p> <p><strong>Additional notes</strong></p> <ul> <li>Likely not reproducible on Apple Silicon / Metal due to different MLX backend implementation</li> <li>The model loads and all components initialize correctly — crash only occurs at inference time when tensor shape is first accessed</li> </ul> </body></html><!--EndFragment--> </body> </html>
GiteaMirror added the bug label 2026-04-29 10:15:03 -05:00
Author
Owner

@YonTracks commented on GitHub (Mar 17, 2026):

howdy, I managed to get this working on windows RTX 3060 12GB with 32GB RAM, seems the same issue (x/flux2-klein) for me the fix was in x\imagegen\safetensors\loader.go at the LoadLinearLayer() function after dequantized := mlx.Dequantize(weight, scales, qbiases, groupSize, bits, mode) I forced materialization there with success (if set to 512x512 it does not fail), good luck!

but still edge case memory issues for default (1024x1024 or when more than > 512) with x/flux2-klein, this only works sometimes and mostly fails in the mlx.go - Data() function at ptr := C.mlx_array_data_float32(arr.c) with error time=2026-03-17T11:51:52.119+10:00 level=WARN source=server.go:164 msg=mlx-runner msg="Exception 0xc0000005 0x0 0x0 0x7ffcf733c056"

I'm sure ollama are on to it and I am most likely way off the real cause, but its a start and it works confirming windows will work, have fun, good luck!

here's the temporary fix
x\imagegen\safetensors\loader.go

LoadLinearLayer()
...

		dequantized := mlx.Dequantize(weight, scales, qbiases, groupSize, bits, mode)
		// Force materialization before the backing quantized tensors are released.
		if bias != nil {
			mlx.Eval(dequantized, bias)
		} else {
			mlx.Eval(dequantized)
		}

		return nn.NewLinear(dequantized, bias), nil

heres the working log:

time=2026-03-17T11:59:02.103+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="Loading FLUX.2 Klein model from manifest: x/flux2-klein:latest..."
time=2026-03-17T11:59:02.409+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="  Loading tokenizer... ✓"
time=2026-03-17T11:59:10.392+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="  Loading text encoder... ✓"
time=2026-03-17T11:59:33.971+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="  Loading transformer... ✓"
time=2026-03-17T11:59:35.758+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="  Loading VAE... ✓"
time=2026-03-17T11:59:35.759+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="  Evaluating weights... ✓"
time=2026-03-17T11:59:35.766+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="  Loaded in 33.66s (19.2 GB VRAM)"
time=2026-03-17T11:59:35.766+10:00 level=WARN source=server.go:164 msg=mlx-runner msg="time=2026-03-17T11:59:35.766+10:00 level=INFO msg=\"mlx runner listening\" addr=127.0.0.1:55116"
time=2026-03-17T11:59:35.788+10:00 level=INFO source=server.go:232 msg="mlx runner is ready" port=55116
time=2026-03-17T11:59:35.788+10:00 level=DEBUG source=sched.go:573 msg="finished setting up" runner.name=registry.ollama.ai/x/flux2-klein:latest runner.size="5.3 GiB" runner.vram="5.3 GiB" runner.parallel=1 runner.pid=10492 runner.model=digest:8c7f37810489d9d7923ff911020de0d4cb527455af1bff7952ae6bd2f4b1cc5c runner.num_ctx=16384
time=2026-03-17T11:59:35.789+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="  Output: 512x512"
time=2026-03-17T11:59:35.795+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="  Encoding prompt... ✓"
time=2026-03-17T11:59:50.143+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="  Evaluating setup... ✓ (14.35s, 19.2 GB)"
time=2026-03-17T11:59:50.144+10:00 level=DEBUG source=server.go:313 msg="mlx response parse error" error="unexpected end of JSON input" line=""
time=2026-03-17T12:00:23.724+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="    step 1: 33.57s (JIT warmup), peak 19.8 GB"
time=2026-03-17T12:00:23.724+10:00 level=DEBUG source=server.go:313 msg="mlx response parse error" error="unexpected end of JSON input" line=""
time=2026-03-17T12:00:57.506+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="    step 2: 33.78s, peak 19.8 GB"
time=2026-03-17T12:00:57.506+10:00 level=DEBUG source=server.go:313 msg="mlx response parse error" error="unexpected end of JSON input" line=""
time=2026-03-17T12:01:31.318+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="    step 3: 33.81s, peak 19.8 GB"
time=2026-03-17T12:01:31.318+10:00 level=DEBUG source=server.go:313 msg="mlx response parse error" error="unexpected end of JSON input" line=""
time=2026-03-17T12:02:05.073+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="    step 4: 33.75s, peak 19.8 GB"
time=2026-03-17T12:02:05.073+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="  Denoised 4 steps in 134.93s (33.73s/step), peak 19.8 GB"
time=2026-03-17T12:02:05.073+10:00 level=DEBUG source=server.go:313 msg="mlx response parse error" error="unexpected end of JSON input" line=""
time=2026-03-17T12:02:08.560+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="  Decoding VAE...  "
time=2026-03-17T12:02:08.560+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="✓ (3.49s, peak 19.8 GB)"
time=2026-03-17T12:02:08.560+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="Generated in 152.77s (4 steps)"
[GIN] 2026/03/17 - 12:02:08 | 200 |          3m7s |       127.0.0.1 | POST     "/api/generate"
time=2026-03-17T12:02:08.616+10:00 level=DEBUG source=sched.go:581 msg="context for request finished"
<!-- gh-comment-id:4071913347 --> @YonTracks commented on GitHub (Mar 17, 2026): howdy, I managed to get this working on windows RTX 3060 12GB with 32GB RAM, seems the same issue (x/flux2-klein) for me the fix was in `x\imagegen\safetensors\loader.go` at the LoadLinearLayer() function after `dequantized := mlx.Dequantize(weight, scales, qbiases, groupSize, bits, mode)` I forced materialization there with success (if set to 512x512 it does not fail), good luck! but still edge case memory issues for default (1024x1024 or when more than > 512) with x/flux2-klein, this only works sometimes and mostly fails in the `mlx.go - Data() function` at `ptr := C.mlx_array_data_float32(arr.c)` with error ```time=2026-03-17T11:51:52.119+10:00 level=WARN source=server.go:164 msg=mlx-runner msg="Exception 0xc0000005 0x0 0x0 0x7ffcf733c056"``` I'm sure ollama are on to it and I am most likely way off the real cause, but its a start and it works confirming windows will work, have fun, good luck! here's the temporary fix x\imagegen\safetensors\loader.go ``` LoadLinearLayer() ... dequantized := mlx.Dequantize(weight, scales, qbiases, groupSize, bits, mode) // Force materialization before the backing quantized tensors are released. if bias != nil { mlx.Eval(dequantized, bias) } else { mlx.Eval(dequantized) } return nn.NewLinear(dequantized, bias), nil ``` heres the working log: ``` time=2026-03-17T11:59:02.103+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="Loading FLUX.2 Klein model from manifest: x/flux2-klein:latest..." time=2026-03-17T11:59:02.409+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" Loading tokenizer... ✓" time=2026-03-17T11:59:10.392+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" Loading text encoder... ✓" time=2026-03-17T11:59:33.971+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" Loading transformer... ✓" time=2026-03-17T11:59:35.758+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" Loading VAE... ✓" time=2026-03-17T11:59:35.759+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" Evaluating weights... ✓" time=2026-03-17T11:59:35.766+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" Loaded in 33.66s (19.2 GB VRAM)" time=2026-03-17T11:59:35.766+10:00 level=WARN source=server.go:164 msg=mlx-runner msg="time=2026-03-17T11:59:35.766+10:00 level=INFO msg=\"mlx runner listening\" addr=127.0.0.1:55116" time=2026-03-17T11:59:35.788+10:00 level=INFO source=server.go:232 msg="mlx runner is ready" port=55116 time=2026-03-17T11:59:35.788+10:00 level=DEBUG source=sched.go:573 msg="finished setting up" runner.name=registry.ollama.ai/x/flux2-klein:latest runner.size="5.3 GiB" runner.vram="5.3 GiB" runner.parallel=1 runner.pid=10492 runner.model=digest:8c7f37810489d9d7923ff911020de0d4cb527455af1bff7952ae6bd2f4b1cc5c runner.num_ctx=16384 time=2026-03-17T11:59:35.789+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" Output: 512x512" time=2026-03-17T11:59:35.795+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" Encoding prompt... ✓" time=2026-03-17T11:59:50.143+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" Evaluating setup... ✓ (14.35s, 19.2 GB)" time=2026-03-17T11:59:50.144+10:00 level=DEBUG source=server.go:313 msg="mlx response parse error" error="unexpected end of JSON input" line="" time=2026-03-17T12:00:23.724+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" step 1: 33.57s (JIT warmup), peak 19.8 GB" time=2026-03-17T12:00:23.724+10:00 level=DEBUG source=server.go:313 msg="mlx response parse error" error="unexpected end of JSON input" line="" time=2026-03-17T12:00:57.506+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" step 2: 33.78s, peak 19.8 GB" time=2026-03-17T12:00:57.506+10:00 level=DEBUG source=server.go:313 msg="mlx response parse error" error="unexpected end of JSON input" line="" time=2026-03-17T12:01:31.318+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" step 3: 33.81s, peak 19.8 GB" time=2026-03-17T12:01:31.318+10:00 level=DEBUG source=server.go:313 msg="mlx response parse error" error="unexpected end of JSON input" line="" time=2026-03-17T12:02:05.073+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" step 4: 33.75s, peak 19.8 GB" time=2026-03-17T12:02:05.073+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" Denoised 4 steps in 134.93s (33.73s/step), peak 19.8 GB" time=2026-03-17T12:02:05.073+10:00 level=DEBUG source=server.go:313 msg="mlx response parse error" error="unexpected end of JSON input" line="" time=2026-03-17T12:02:08.560+10:00 level=INFO source=server.go:157 msg=mlx-runner msg=" Decoding VAE... " time=2026-03-17T12:02:08.560+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="✓ (3.49s, peak 19.8 GB)" time=2026-03-17T12:02:08.560+10:00 level=INFO source=server.go:157 msg=mlx-runner msg="Generated in 152.77s (4 steps)" [GIN] 2026/03/17 - 12:02:08 | 200 | 3m7s | 127.0.0.1 | POST "/api/generate" time=2026-03-17T12:02:08.616+10:00 level=DEBUG source=sched.go:581 msg="context for request finished" ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56088