[GH-ISSUE #15318] Gemma 4 (26b, 31b) crashes with segfault on DGX Spark (ARM64 + Blackwell GB10) #9797

Open
opened 2026-04-12 22:40:34 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @fgomsan on GitHub (Apr 4, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15318

Ollama version: 0.20.0
OS: Ubuntu ARM64 (DGX Spark)
GPU: NVIDIA Blackwell GB10 (128 GB unified memory)

What happened:
Both gemma4:26b and gemma4:31b crash with segfault when loading. No other models are loaded, 118 GB RAM available.

Error from logs:
llama runner terminated: exit status 2
model failed to load, this may be due to resource limitations or an internal error
Full register dump visible in journalctl -u ollama.

Other models work fine on same hardware: Nemotron 3 Super 120B, Qwen 3.5 35B, Qwen 2.5 72B — all load and run without issues.

Gemma 4 26b works on MacBook Pro M3 Max (128 GB) with LM Studio / MLX.

Originally created by @fgomsan on GitHub (Apr 4, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15318 Ollama version: 0.20.0 OS: Ubuntu ARM64 (DGX Spark) GPU: NVIDIA Blackwell GB10 (128 GB unified memory) What happened: Both gemma4:26b and gemma4:31b crash with segfault when loading. No other models are loaded, 118 GB RAM available. Error from logs: llama runner terminated: exit status 2 model failed to load, this may be due to resource limitations or an internal error Full register dump visible in journalctl -u ollama. Other models work fine on same hardware: Nemotron 3 Super 120B, Qwen 3.5 35B, Qwen 2.5 72B — all load and run without issues. Gemma 4 26b works on MacBook Pro M3 Max (128 GB) with LM Studio / MLX.
Author
Owner

@rick-github commented on GitHub (Apr 4, 2026):

Server logs will aid in debugging.

<!-- gh-comment-id:4186865694 --> @rick-github commented on GitHub (Apr 4, 2026): [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.
Author
Owner

@avirtuos commented on GitHub (Apr 5, 2026):

not OP but I had the same issue, here are my logs.

Explore-logs-2026-04-05 00_39_30.txt

<!-- gh-comment-id:4188267835 --> @avirtuos commented on GitHub (Apr 5, 2026): not OP but I had the same issue, here are my logs. [Explore-logs-2026-04-05 00_39_30.txt](https://github.com/user-attachments/files/26486088/Explore-logs-2026-04-05.00_39_30.txt)
Author
Owner

@fgomsan commented on GitHub (Apr 5, 2026):

Server logs from the crash (DGX Spark, ARM64 + GB10 Blackwell, Ollama 0.20)

Reproduced with both gemma4:26b and gemma4:31b. The runner crashes with exit status 2 during model load.

Key crash indicators:
fault 0x0
pc 0xef2b8b9a7608
lr 0xef2b8b9a75f4
Null pointer dereference during layer offloading. The runner attempts to split layers between GPU and CPU (55/89 to GPU), then crashes:
time=2026-04-05T08:47:53.713 level=ERROR msg="do load request" error="Post "http://127.0.0.1:35421/load": EOF"
time=2026-04-05T08:47:53.723 level=ERROR msg="do load request" error="dial tcp 127.0.0.1:35421: connect: connection refused"
time=2026-04-05T08:47:53.799 level=ERROR msg="llama runner terminated" error="exit status 2"
System info:

• Device: NVIDIA GB10, compute capability 12.1, CUDA backend /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
• GPU available: ~54 GiB free
• System RAM: 121.7 GiB total
• OS: Ubuntu ARM64
• CUDA ARCHS: 750,800,860,870,890,900,1000,1030,1100,1200,1210

Attempted layer split:
model weights CUDA0: 47.9 GiB
model weights CPU: 32.9 GiB
kv cache CUDA0: 3.2 GiB
kv cache CPU: 1.9 GiB
compute graph CUDA0: 2.6 GiB
compute graph CPU: 40.2 MiB
total: 88.6 GiB
Note: Same GGUF runs fine via llama.cpp directly (port 8085, 54 tok/s). The issue appears specific to Ollama's runner on ARM64 + GB10.

<!-- gh-comment-id:4188425101 --> @fgomsan commented on GitHub (Apr 5, 2026): Server logs from the crash (DGX Spark, ARM64 + GB10 Blackwell, Ollama 0.20) Reproduced with both gemma4:26b and gemma4:31b. The runner crashes with exit status 2 during model load. Key crash indicators: fault 0x0 pc 0xef2b8b9a7608 lr 0xef2b8b9a75f4 Null pointer dereference during layer offloading. The runner attempts to split layers between GPU and CPU (55/89 to GPU), then crashes: time=2026-04-05T08:47:53.713 level=ERROR msg="do load request" error="Post \"http://127.0.0.1:35421/load\": EOF" time=2026-04-05T08:47:53.723 level=ERROR msg="do load request" error="dial tcp 127.0.0.1:35421: connect: connection refused" time=2026-04-05T08:47:53.799 level=ERROR msg="llama runner terminated" error="exit status 2" System info: • Device: NVIDIA GB10, compute capability 12.1, CUDA backend /usr/local/lib/ollama/cuda_v13/libggml-cuda.so • GPU available: ~54 GiB free • System RAM: 121.7 GiB total • OS: Ubuntu ARM64 • CUDA ARCHS: 750,800,860,870,890,900,1000,1030,1100,1200,1210 Attempted layer split: model weights CUDA0: 47.9 GiB model weights CPU: 32.9 GiB kv cache CUDA0: 3.2 GiB kv cache CPU: 1.9 GiB compute graph CUDA0: 2.6 GiB compute graph CPU: 40.2 MiB total: 88.6 GiB Note: Same GGUF runs fine via llama.cpp directly (port 8085, 54 tok/s). The issue appears specific to Ollama's runner on ARM64 + GB10.
Author
Owner

@rick-github commented on GitHub (Apr 5, 2026):

Unable to replicate. Please set OLLAMA_DEBUG=2 in the server environment and provide the full log starting at the line that says server config.

<!-- gh-comment-id:4189118685 --> @rick-github commented on GitHub (Apr 5, 2026): Unable to replicate. Please set `OLLAMA_DEBUG=2` in the server environment and provide the full log starting at the line that says `server config`.
Author
Owner

@johnlockejrr commented on GitHub (Apr 11, 2026):

I got the same problem. In my case the issue was that I have set in the server config Environment="OLLAMA_NUM_PARALLEL=4". Removing that, buum!

incognito@gx10-6100:~/dev/ollama$ ollama run gemma4:26b-a4b-it-q4_K_M
>>> Hi!
Thinking...
The user said "Hi!".
This is a standard greeting.
The goal is to respond politely and offer assistance.

    *   "Hello! How can I help you today?"
    *   "Hi there! What's on your mind?"
    *   "Greetings! Is there anything I can assist you with?"

A simple, friendly, and helpful response is best.
...done thinking.

Hello! How can I help you today?

>>> Send a message (/? for help)
<!-- gh-comment-id:4229441842 --> @johnlockejrr commented on GitHub (Apr 11, 2026): I got the same problem. In my case the issue was that I have set in the server config `Environment="OLLAMA_NUM_PARALLEL=4"`. Removing that, buum! ``` incognito@gx10-6100:~/dev/ollama$ ollama run gemma4:26b-a4b-it-q4_K_M >>> Hi! Thinking... The user said "Hi!". This is a standard greeting. The goal is to respond politely and offer assistance. * "Hello! How can I help you today?" * "Hi there! What's on your mind?" * "Greetings! Is there anything I can assist you with?" A simple, friendly, and helpful response is best. ...done thinking. Hello! How can I help you today? >>> Send a message (/? for help) ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9797