[GH-ISSUE #15318] Gemma 4 (26b, 31b) crashes with segfault on DGX Spark (ARM64 + Blackwell GB10) #71857

Open
opened 2026-05-05 02:45:16 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @fgomsan on GitHub (Apr 4, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15318

Ollama version: 0.20.0
OS: Ubuntu ARM64 (DGX Spark)
GPU: NVIDIA Blackwell GB10 (128 GB unified memory)

What happened:
Both gemma4:26b and gemma4:31b crash with segfault when loading. No other models are loaded, 118 GB RAM available.

Error from logs:
llama runner terminated: exit status 2
model failed to load, this may be due to resource limitations or an internal error
Full register dump visible in journalctl -u ollama.

Other models work fine on same hardware: Nemotron 3 Super 120B, Qwen 3.5 35B, Qwen 2.5 72B — all load and run without issues.

Gemma 4 26b works on MacBook Pro M3 Max (128 GB) with LM Studio / MLX.

Originally created by @fgomsan on GitHub (Apr 4, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15318 Ollama version: 0.20.0 OS: Ubuntu ARM64 (DGX Spark) GPU: NVIDIA Blackwell GB10 (128 GB unified memory) What happened: Both gemma4:26b and gemma4:31b crash with segfault when loading. No other models are loaded, 118 GB RAM available. Error from logs: llama runner terminated: exit status 2 model failed to load, this may be due to resource limitations or an internal error Full register dump visible in journalctl -u ollama. Other models work fine on same hardware: Nemotron 3 Super 120B, Qwen 3.5 35B, Qwen 2.5 72B — all load and run without issues. Gemma 4 26b works on MacBook Pro M3 Max (128 GB) with LM Studio / MLX.
Author
Owner

@rick-github commented on GitHub (Apr 4, 2026):

Server logs will aid in debugging.

<!-- gh-comment-id:4186865694 --> @rick-github commented on GitHub (Apr 4, 2026): [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.
Author
Owner

@avirtuos commented on GitHub (Apr 5, 2026):

not OP but I had the same issue, here are my logs.

Explore-logs-2026-04-05 00_39_30.txt

<!-- gh-comment-id:4188267835 --> @avirtuos commented on GitHub (Apr 5, 2026): not OP but I had the same issue, here are my logs. [Explore-logs-2026-04-05 00_39_30.txt](https://github.com/user-attachments/files/26486088/Explore-logs-2026-04-05.00_39_30.txt)
Author
Owner

@fgomsan commented on GitHub (Apr 5, 2026):

Server logs from the crash (DGX Spark, ARM64 + GB10 Blackwell, Ollama 0.20)

Reproduced with both gemma4:26b and gemma4:31b. The runner crashes with exit status 2 during model load.

Key crash indicators:
fault 0x0
pc 0xef2b8b9a7608
lr 0xef2b8b9a75f4
Null pointer dereference during layer offloading. The runner attempts to split layers between GPU and CPU (55/89 to GPU), then crashes:
time=2026-04-05T08:47:53.713 level=ERROR msg="do load request" error="Post "http://127.0.0.1:35421/load": EOF"
time=2026-04-05T08:47:53.723 level=ERROR msg="do load request" error="dial tcp 127.0.0.1:35421: connect: connection refused"
time=2026-04-05T08:47:53.799 level=ERROR msg="llama runner terminated" error="exit status 2"
System info:

• Device: NVIDIA GB10, compute capability 12.1, CUDA backend /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
• GPU available: ~54 GiB free
• System RAM: 121.7 GiB total
• OS: Ubuntu ARM64
• CUDA ARCHS: 750,800,860,870,890,900,1000,1030,1100,1200,1210

Attempted layer split:
model weights CUDA0: 47.9 GiB
model weights CPU: 32.9 GiB
kv cache CUDA0: 3.2 GiB
kv cache CPU: 1.9 GiB
compute graph CUDA0: 2.6 GiB
compute graph CPU: 40.2 MiB
total: 88.6 GiB
Note: Same GGUF runs fine via llama.cpp directly (port 8085, 54 tok/s). The issue appears specific to Ollama's runner on ARM64 + GB10.

<!-- gh-comment-id:4188425101 --> @fgomsan commented on GitHub (Apr 5, 2026): Server logs from the crash (DGX Spark, ARM64 + GB10 Blackwell, Ollama 0.20) Reproduced with both gemma4:26b and gemma4:31b. The runner crashes with exit status 2 during model load. Key crash indicators: fault 0x0 pc 0xef2b8b9a7608 lr 0xef2b8b9a75f4 Null pointer dereference during layer offloading. The runner attempts to split layers between GPU and CPU (55/89 to GPU), then crashes: time=2026-04-05T08:47:53.713 level=ERROR msg="do load request" error="Post \"http://127.0.0.1:35421/load\": EOF" time=2026-04-05T08:47:53.723 level=ERROR msg="do load request" error="dial tcp 127.0.0.1:35421: connect: connection refused" time=2026-04-05T08:47:53.799 level=ERROR msg="llama runner terminated" error="exit status 2" System info: • Device: NVIDIA GB10, compute capability 12.1, CUDA backend /usr/local/lib/ollama/cuda_v13/libggml-cuda.so • GPU available: ~54 GiB free • System RAM: 121.7 GiB total • OS: Ubuntu ARM64 • CUDA ARCHS: 750,800,860,870,890,900,1000,1030,1100,1200,1210 Attempted layer split: model weights CUDA0: 47.9 GiB model weights CPU: 32.9 GiB kv cache CUDA0: 3.2 GiB kv cache CPU: 1.9 GiB compute graph CUDA0: 2.6 GiB compute graph CPU: 40.2 MiB total: 88.6 GiB Note: Same GGUF runs fine via llama.cpp directly (port 8085, 54 tok/s). The issue appears specific to Ollama's runner on ARM64 + GB10.
Author
Owner

@rick-github commented on GitHub (Apr 5, 2026):

Unable to replicate. Please set OLLAMA_DEBUG=2 in the server environment and provide the full log starting at the line that says server config.

<!-- gh-comment-id:4189118685 --> @rick-github commented on GitHub (Apr 5, 2026): Unable to replicate. Please set `OLLAMA_DEBUG=2` in the server environment and provide the full log starting at the line that says `server config`.
Author
Owner

@johnlockejrr commented on GitHub (Apr 11, 2026):

I got the same problem. In my case the issue was that I have set in the server config Environment="OLLAMA_NUM_PARALLEL=4". Removing that, buum!

incognito@gx10-6100:~/dev/ollama$ ollama run gemma4:26b-a4b-it-q4_K_M
>>> Hi!
Thinking...
The user said "Hi!".
This is a standard greeting.
The goal is to respond politely and offer assistance.

    *   "Hello! How can I help you today?"
    *   "Hi there! What's on your mind?"
    *   "Greetings! Is there anything I can assist you with?"

A simple, friendly, and helpful response is best.
...done thinking.

Hello! How can I help you today?

>>> Send a message (/? for help)
<!-- gh-comment-id:4229441842 --> @johnlockejrr commented on GitHub (Apr 11, 2026): I got the same problem. In my case the issue was that I have set in the server config `Environment="OLLAMA_NUM_PARALLEL=4"`. Removing that, buum! ``` incognito@gx10-6100:~/dev/ollama$ ollama run gemma4:26b-a4b-it-q4_K_M >>> Hi! Thinking... The user said "Hi!". This is a standard greeting. The goal is to respond politely and offer assistance. * "Hello! How can I help you today?" * "Hi there! What's on your mind?" * "Greetings! Is there anything I can assist you with?" A simple, friendly, and helpful response is best. ...done thinking. Hello! How can I help you today? >>> Send a message (/? for help) ```
Author
Owner

@caquino commented on GitHub (Apr 14, 2026):

Confirming this on a second DGX Spark (Ollama 0.20.6, GB10/Blackwell, Ubuntu ARM64).

Same gemma4:31b segfault; removing OLLAMA_NUM_PARALLEL=4 makes it work. Every other model on this hardware (qwen3-coder-next:80b, deepseek-r1:70b, glm-4.7-flash:30b, llama3.3:70b, mistral-small:24b, qwen3.5:35b) runs fine with NUM_PARALLEL=4.

Full OLLAMA_DEBUG=2 log captured per @rick-github earlier request: https://gist.github.com/caquino/36c5e6ee7c4d15f575c1304d5380f110

While capturing it, I noticed the log actually contains two distinct crashes, and I think only one is this issue. It may be that the first one is actually expected; I just never paid attention to the logs, and it is my lack of knowledge showing.

Wanted to flag both so you can tell me if I'm reading them right:

Crash A: startup, line ~150, looks unrelated to this issue. The cuda_v12 backend aborts during init, then cuda_v13 loads successfully right after (lines ~178+), and the daemon recovers and keeps running. So it's non-fatal because the v13 fallback works, but it does mean Ollama prints a panic + goroutine dump on every startup on this hardware. My read is the cuda_v12 build wasn't compiled with arch 121 (Blackwell), and the assertion fires whenever it sees a GB10. Not sure if this is expected.

Crash B: model load, line ~3377 looks like the actual issue. After enabling flash attention for gemma4, the runner subprocess SIGSEGVs with fault 0x0. One note: I also tried disabling flash attention to understand if it was the cause, since it happens right after the enabling flash attention log line, but the crash is identical with flash attention disabled.

If there's something specific you need, I'm happy to keep iterating since I have the reproducer hot.

<!-- gh-comment-id:4242672962 --> @caquino commented on GitHub (Apr 14, 2026): Confirming this on a second DGX Spark (Ollama 0.20.6, GB10/Blackwell, Ubuntu ARM64). Same `gemma4:31b` segfault; removing `OLLAMA_NUM_PARALLEL=4` makes it work. Every other model on this hardware (`qwen3-coder-next:80b`, `deepseek-r1:70b`, `glm-4.7-flash:30b`, `llama3.3:70b`, `mistral-small:24b`, `qwen3.5:35b`) runs fine with `NUM_PARALLEL=4`. Full OLLAMA_DEBUG=2 log captured per @rick-github earlier request: https://gist.github.com/caquino/36c5e6ee7c4d15f575c1304d5380f110 While capturing it, I noticed the log actually contains two distinct crashes, and I think only one is this issue. It may be that the first one is actually expected; I just never paid attention to the logs, and it is my lack of knowledge showing. Wanted to flag both so you can tell me if I'm reading them right: Crash A: startup, line ~150, looks unrelated to this issue. The `cuda_v12` backend aborts during init, then `cuda_v13` loads successfully right after (lines ~178+), and the daemon recovers and keeps running. So it's non-fatal because the v13 fallback works, but it does mean Ollama prints a panic + goroutine dump on every startup on this hardware. My read is the `cuda_v12` build wasn't compiled with arch 121 (Blackwell), and the assertion fires whenever it sees a GB10. Not sure if this is expected. Crash B: model load, line ~3377 looks like the actual issue. After enabling flash attention for gemma4, the runner subprocess SIGSEGVs with fault 0x0. One note: I also tried disabling flash attention to understand if it was the cause, since it happens right after the enabling flash attention log line, but the crash is identical with flash attention disabled. If there's something specific you need, I'm happy to keep iterating since I have the reproducer hot.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71857