[GH-ISSUE #15531] Image generation crashes on NVIDIA Blackwell GPUs (RTX 5070) — MLX-C rms_norm returns 0-dim array #56437

Open
opened 2026-04-29 10:49:28 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @johnohhh1 on GitHub (Apr 13, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15531

Environment

  • Ollama version: v0.20.6
  • OS: Ubuntu 26.04 (kernel 7.0.0-13-generic)
  • GPU: NVIDIA GeForce RTX 5070 (Blackwell, sm_120, compute 12.0)
  • Driver: 580.142, CUDA 13.0
  • MLX lib: mlx_cuda_v13/libmlx.so — version 0.31.1-23-g38ad257, has native sm_120 code

Steps to Reproduce

ollama pull x/flux2-klein
ollama run x/flux2-klein "a red apple on a table"

Also reproducible with x/z-image-turbo.

Expected Behavior

Image generated and saved to current directory.

Actual Behavior

Error: 500 Internal Server Error: Post "http://127.0.0.1:<port>/completion": EOF

Server log shows the MLX runner panics:

runtime error: index out of range [0] with length 0
goroutine 66 [running]:
github.com/ollama/ollama/x/imagegen/models/qwen3.applyRoPEQwen3(...)
    x/imagegen/models/qwen3/text_encoder.go:47

The crash occurs at text_encoder.go:47 where x.Shape() returns an empty slice after mlx_fast_rms_norm produces a 0-dimensional array.

Root Cause Analysis

The MLX runner loads the model successfully (tokenizer ✓, text encoder ✓, transformer ✓, VAE ✓, 5.3 GB VRAM), starts listening, then panics on the first /completion request.

The call chain is:

  1. Attention.Forward() calls QNorm.Forward(q, 1e-6)
  2. Which calls mlx.RMSNorm(x, weight, eps)C.mlx_fast_rms_norm(&res, x.c, weight.c, eps, stream)
  3. The returned array has ndim=0 (empty shape)
  4. applyRoPEQwen3 then panics accessing shape[0] on the empty slice

Key finding: The Python MLX package at the same version (0.31.1) works correctly on this GPU:

pip install "mlx[cuda13]"

import mlx.core as mx
x = mx.random.normal((1, 512, 32, 128))
weight = mx.ones((128,))
result = mx.fast.rms_norm(x, weight, eps=1e-6)
mx.eval(result)
print(result.shape)  # (1, 512, 32, 128) — correct!

The shipped libmlx.so (0.31.1-23-g38ad257, 23 commits ahead of release) has the bug. The pip release libmlx.so (0.31.1 clean) does not. The issue appears to be in the 23 extra commits in the Ollama fork.

Additionally, these same image models work perfectly on the same GPU via ComfyUI (PyTorch CUDA), confirming the hardware and CUDA drivers are fine.

Workaround

We built the Linux port of the Ollama desktop app and worked around this by routing image generation through a local PyTorch/diffusers server instead of the broken MLX runner. The desktop app detects CapabilityImage models and calls a local server using diffusers.AutoPipelineForText2Image with enable_model_cpu_offload(). Same models, same GPU, works perfectly.

Suggested Fix

Either:

  1. Rebuild the shipped libmlx.so from the clean 0.31.1 release tag (the pip wheels work)
  2. Investigate what the 23 extra commits (38ad257) broke in the CUDA rms_norm kernel path
  3. Add a bounds check in applyRoPEQwen3 so it returns a meaningful error instead of panicking on 0-dim arrays
Originally created by @johnohhh1 on GitHub (Apr 13, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15531 ## Environment - **Ollama version:** v0.20.6 - **OS:** Ubuntu 26.04 (kernel 7.0.0-13-generic) - **GPU:** NVIDIA GeForce RTX 5070 (Blackwell, sm_120, compute 12.0) - **Driver:** 580.142, CUDA 13.0 - **MLX lib:** `mlx_cuda_v13/libmlx.so` — version 0.31.1-23-g38ad257, has native sm_120 code ## Steps to Reproduce ```bash ollama pull x/flux2-klein ollama run x/flux2-klein "a red apple on a table" ``` Also reproducible with `x/z-image-turbo`. ## Expected Behavior Image generated and saved to current directory. ## Actual Behavior ``` Error: 500 Internal Server Error: Post "http://127.0.0.1:<port>/completion": EOF ``` Server log shows the MLX runner panics: ``` runtime error: index out of range [0] with length 0 goroutine 66 [running]: github.com/ollama/ollama/x/imagegen/models/qwen3.applyRoPEQwen3(...) x/imagegen/models/qwen3/text_encoder.go:47 ``` The crash occurs at `text_encoder.go:47` where `x.Shape()` returns an empty slice after `mlx_fast_rms_norm` produces a 0-dimensional array. ## Root Cause Analysis The MLX runner loads the model successfully (tokenizer ✓, text encoder ✓, transformer ✓, VAE ✓, 5.3 GB VRAM), starts listening, then panics on the first `/completion` request. The call chain is: 1. `Attention.Forward()` calls `QNorm.Forward(q, 1e-6)` 2. Which calls `mlx.RMSNorm(x, weight, eps)` → `C.mlx_fast_rms_norm(&res, x.c, weight.c, eps, stream)` 3. The returned array has `ndim=0` (empty shape) 4. `applyRoPEQwen3` then panics accessing `shape[0]` on the empty slice **Key finding:** The Python MLX package at the same version (0.31.1) works correctly on this GPU: ```python pip install "mlx[cuda13]" import mlx.core as mx x = mx.random.normal((1, 512, 32, 128)) weight = mx.ones((128,)) result = mx.fast.rms_norm(x, weight, eps=1e-6) mx.eval(result) print(result.shape) # (1, 512, 32, 128) — correct! ``` The shipped `libmlx.so` (0.31.1-**23-g38ad257**, 23 commits ahead of release) has the bug. The pip release `libmlx.so` (0.31.1 clean) does not. The issue appears to be in the 23 extra commits in the Ollama fork. Additionally, these same image models work perfectly on the same GPU via ComfyUI (PyTorch CUDA), confirming the hardware and CUDA drivers are fine. ## Workaround We built the [Linux port of the Ollama desktop app](https://github.com/johnohhh1/ollama-webchat-ubuntu) and worked around this by routing image generation through a local PyTorch/diffusers server instead of the broken MLX runner. The desktop app detects `CapabilityImage` models and calls a local server using `diffusers.AutoPipelineForText2Image` with `enable_model_cpu_offload()`. Same models, same GPU, works perfectly. ## Suggested Fix Either: 1. Rebuild the shipped `libmlx.so` from the clean 0.31.1 release tag (the pip wheels work) 2. Investigate what the 23 extra commits (`38ad257`) broke in the CUDA `rms_norm` kernel path 3. Add a bounds check in `applyRoPEQwen3` so it returns a meaningful error instead of panicking on 0-dim arrays
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15531
Analyzed: 2026-04-18T18:21:21.389066

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274308172 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15531 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15531 **Analyzed**: 2026-04-18T18:21:21.389066 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56437