[GH-ISSUE #15368] Gemma 4 on Apple Silicon M5 Max: FA hang, /v1 streaming reasoning field, MLX not supported — comprehensive findings #35593

Closed
opened 2026-04-22 20:12:32 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @ErcinDedeoglu on GitHub (Apr 6, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15368

Note: This issue was created by an AI agent (Claude) summarizing findings from an extensive debugging session with a human operator. All tests were performed manually and results verified.

Environment

  • Machine: MacBook Pro M5 Max, 128 GB unified memory (107.5 GiB usable)
  • OS: macOS Sequoia
  • Ollama: v0.20.2 (installed via curl -fsSL https://ollama.com/install.sh | sh)
  • Dock: CalDigit TS5 Plus Thunderbolt 5 (multiple external monitors + webcam)
  • Models tested: gemma4:31b (Dense, Q4_K_M, 19GB), gemma4:26b (MoE, Q4_K_M, 17GB), various abliterated variants

Summary

Three distinct bugs affect Gemma 4 usability on Apple Silicon. Together they make Gemma 4 effectively unusable for agentic/tool-calling workloads on Mac.


Bug 1: Flash Attention hangs on Gemma 4 31B Dense (>500 token prompts)

Symptom: gemma4:31b hangs indefinitely during prompt eval when OLLAMA_FLASH_ATTENTION=1 and prompt exceeds ~500 tokens. CPU/GPU utilization drops to 0%. No timeout, no error — complete stall.

Root cause: Gemma 4's hybrid attention architecture uses 50 sliding-window layers + 10 global attention layers with different head dimensions (256 vs 512). The FA implementation doesn't handle this dual-dimension layout correctly.

Affected:

Model Quant FA=1 Result
gemma4:31b (Dense) Q4_K_M Hangs at >500 tokens
gemma4:26b (MoE) Q8_0 Q8_0 Hangs (2x data volume crosses threshold)
gemma4:26b heretic (mradermacher) Q4_K_S Intermittent — alternating success/fail

Not affected:

Model Quant FA=1 Result
gemma4:26b (MoE, official) Q4_K_M Works perfectly
gemma4:26b (APEX I-Mini) Custom Works perfectly
gemma4:26b (TrevorJS EGA) Q4_K_M Works perfectly

Workaround: OLLAMA_FLASH_ATTENTION=0 — but this makes the 31B Dense model very slow (~15 tok/s generation, ~168 tok/s prompt eval on M5 Max 128GB). Without FA, processing a 31K token prompt takes ~190 seconds.

Related: #15350, #15237


Bug 2: /v1/chat/completions streaming puts all output in reasoning field

Symptom: When streaming via the OpenAI-compatible /v1/chat/completions endpoint, ALL Gemma 4 models (both 26B MoE and 31B Dense) emit content in the reasoning field with empty content field:

{"choices":[{"delta":{"role":"assistant","content":"","reasoning":"The user said..."}}]}

This breaks any client using @ai-sdk/openai-compatible or similar OpenAI SDK wrappers, because they read content (which is always empty).

Key findings:

  • /v1/chat/completions with stream:falsecontent is populated, reasoning also present. Works.
  • /v1/chat/completions with stream:truecontent is always "", everything in reasoning. Broken.
  • /api/chat with think:falsecontent is populated, no reasoning. Works perfectly.
  • /api/chat with think:true (default) → thinking goes to thinking field, response to content. Works.
  • The /v1 endpoint does not support think:false parameter — it's silently ignored.

Workaround: Use the native /api/chat endpoint with think:false instead of /v1/chat/completions. We built a Node.js proxy (port 11435 → 11434) that translates /v1/chat/completions requests to /api/chat with think:false and converts the native response format back to OpenAI SSE format. This works but shouldn't be necessary.

Suggested fix: Either:

  1. Support think parameter on the /v1 endpoint (map it from a header or request body field)
  2. Don't emit reasoning in streaming SSE when the model's thinking output should go to content
  3. At minimum, populate content alongside reasoning so clients can read either

Related: #20995 (Vercel AI SDK)


Bug 3: MLX runner does not support Gemma 4 architecture

Symptom: Gemma4ForConditionalGeneration error. Ollama falls back to llama.cpp/GGUF runner.

Impact on M5 Max: Without MLX, the llama.cpp runner achieves ~15 tok/s on 31B Dense and ~75 tok/s on 26B MoE. MLX would likely provide significantly better performance due to Apple Silicon unified memory optimizations.

Status: PR #15244 by @dhiltgen is in progress. This is the most impactful fix for Apple Silicon users.


Additional finding: Thunderbolt dock monitor resets during inference

Not strictly an Ollama bug, but worth documenting for Apple Silicon users:

Large model inference saturates the unified memory bus, starving the Thunderbolt display pipeline. This causes monitors connected via Thunderbolt docks (CalDigit TS5 Plus in our case) to briefly reset/go black.

Model-specific behavior on M5 Max 128GB:

Model Behavior
31B Dense (any prompt) Monitor reset on prompts >1000 tokens. Reads 16.5 GiB weights per step
26B MoE warm Stable even at 7000+ tokens. Only reads ~2 GiB per step (3.8B active params)
26B MoE cold load One-time reset during model load, then stable

Mitigation:

OLLAMA_NUM_PARALLEL=1
OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_FLASH_ATTENTION=1  # reduces peak bandwidth (but causes Bug 1 on 31B Dense)

Test matrix summary

All tests on M5 Max 128GB, Ollama v0.20.2:

Model FA /v1 stream /v1 non-stream /api/chat think:false Speed Memory
gemma4:31b ON Hangs >500 tok Hangs >500 tok Hangs >500 tok N/A N/A
gemma4:31b OFF reasoning field only Works Works 15 tok/s gen, 168 tok/s prompt 32 GiB (64K ctx)
gemma4:26b ON reasoning field only Works Works 75 tok/s 22 GiB (256K ctx)
gemma4:26b OFF reasoning field only Works Works 75 tok/s 22 GiB

Reference

Full configuration details, model benchmarks, and workarounds documented at: https://github.com/ErcinDedeoglu/ollama-apple-silicon-guide

Originally created by @ErcinDedeoglu on GitHub (Apr 6, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15368 > **Note:** This issue was created by an AI agent (Claude) summarizing findings from an extensive debugging session with a human operator. All tests were performed manually and results verified. ## Environment - **Machine:** MacBook Pro M5 Max, 128 GB unified memory (107.5 GiB usable) - **OS:** macOS Sequoia - **Ollama:** v0.20.2 (installed via `curl -fsSL https://ollama.com/install.sh | sh`) - **Dock:** CalDigit TS5 Plus Thunderbolt 5 (multiple external monitors + webcam) - **Models tested:** `gemma4:31b` (Dense, Q4_K_M, 19GB), `gemma4:26b` (MoE, Q4_K_M, 17GB), various abliterated variants ## Summary Three distinct bugs affect Gemma 4 usability on Apple Silicon. Together they make Gemma 4 effectively unusable for agentic/tool-calling workloads on Mac. --- ## Bug 1: Flash Attention hangs on Gemma 4 31B Dense (>500 token prompts) **Symptom:** `gemma4:31b` hangs indefinitely during prompt eval when `OLLAMA_FLASH_ATTENTION=1` and prompt exceeds ~500 tokens. CPU/GPU utilization drops to 0%. No timeout, no error — complete stall. **Root cause:** Gemma 4's hybrid attention architecture uses 50 sliding-window layers + 10 global attention layers with different head dimensions (256 vs 512). The FA implementation doesn't handle this dual-dimension layout correctly. **Affected:** | Model | Quant | FA=1 Result | |-------|-------|-------------| | gemma4:31b (Dense) | Q4_K_M | Hangs at >500 tokens | | gemma4:26b (MoE) Q8_0 | Q8_0 | Hangs (2x data volume crosses threshold) | | gemma4:26b heretic (mradermacher) | Q4_K_S | Intermittent — alternating success/fail | **Not affected:** | Model | Quant | FA=1 Result | |-------|-------|-------------| | gemma4:26b (MoE, official) | Q4_K_M | Works perfectly | | gemma4:26b (APEX I-Mini) | Custom | Works perfectly | | gemma4:26b (TrevorJS EGA) | Q4_K_M | Works perfectly | **Workaround:** `OLLAMA_FLASH_ATTENTION=0` — but this makes the 31B Dense model very slow (~15 tok/s generation, ~168 tok/s prompt eval on M5 Max 128GB). Without FA, processing a 31K token prompt takes ~190 seconds. **Related:** #15350, #15237 --- ## Bug 2: `/v1/chat/completions` streaming puts all output in `reasoning` field **Symptom:** When streaming via the OpenAI-compatible `/v1/chat/completions` endpoint, ALL Gemma 4 models (both 26B MoE and 31B Dense) emit content in the `reasoning` field with empty `content` field: ```json {"choices":[{"delta":{"role":"assistant","content":"","reasoning":"The user said..."}}]} ``` This breaks any client using `@ai-sdk/openai-compatible` or similar OpenAI SDK wrappers, because they read `content` (which is always empty). **Key findings:** - `/v1/chat/completions` with `stream:false` → `content` is populated, `reasoning` also present. Works. - `/v1/chat/completions` with `stream:true` → `content` is always `""`, everything in `reasoning`. Broken. - `/api/chat` with `think:false` → `content` is populated, no reasoning. Works perfectly. - `/api/chat` with `think:true` (default) → thinking goes to `thinking` field, response to `content`. Works. - The `/v1` endpoint does **not** support `think:false` parameter — it's silently ignored. **Workaround:** Use the native `/api/chat` endpoint with `think:false` instead of `/v1/chat/completions`. We built a Node.js proxy (port 11435 → 11434) that translates `/v1/chat/completions` requests to `/api/chat` with `think:false` and converts the native response format back to OpenAI SSE format. This works but shouldn't be necessary. **Suggested fix:** Either: 1. Support `think` parameter on the `/v1` endpoint (map it from a header or request body field) 2. Don't emit `reasoning` in streaming SSE when the model's thinking output should go to `content` 3. At minimum, populate `content` alongside `reasoning` so clients can read either **Related:** #20995 (Vercel AI SDK) --- ## Bug 3: MLX runner does not support Gemma 4 architecture **Symptom:** `Gemma4ForConditionalGeneration` error. Ollama falls back to llama.cpp/GGUF runner. **Impact on M5 Max:** Without MLX, the llama.cpp runner achieves ~15 tok/s on 31B Dense and ~75 tok/s on 26B MoE. MLX would likely provide significantly better performance due to Apple Silicon unified memory optimizations. **Status:** PR #15244 by @dhiltgen is in progress. This is the most impactful fix for Apple Silicon users. --- ## Additional finding: Thunderbolt dock monitor resets during inference Not strictly an Ollama bug, but worth documenting for Apple Silicon users: Large model inference saturates the unified memory bus, starving the Thunderbolt display pipeline. This causes monitors connected via Thunderbolt docks (CalDigit TS5 Plus in our case) to briefly reset/go black. **Model-specific behavior on M5 Max 128GB:** | Model | Behavior | |-------|----------| | 31B Dense (any prompt) | Monitor reset on prompts >1000 tokens. Reads 16.5 GiB weights per step | | 26B MoE warm | Stable even at 7000+ tokens. Only reads ~2 GiB per step (3.8B active params) | | 26B MoE cold load | One-time reset during model load, then stable | **Mitigation:** ```bash OLLAMA_NUM_PARALLEL=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_FLASH_ATTENTION=1 # reduces peak bandwidth (but causes Bug 1 on 31B Dense) ``` --- ## Test matrix summary All tests on M5 Max 128GB, Ollama v0.20.2: | Model | FA | /v1 stream | /v1 non-stream | /api/chat think:false | Speed | Memory | |-------|----|-----------|----------------|----------------------|-------|--------| | gemma4:31b | ON | Hangs >500 tok | Hangs >500 tok | Hangs >500 tok | N/A | N/A | | gemma4:31b | OFF | reasoning field only | Works | Works | 15 tok/s gen, 168 tok/s prompt | 32 GiB (64K ctx) | | gemma4:26b | ON | reasoning field only | Works | Works | 75 tok/s | 22 GiB (256K ctx) | | gemma4:26b | OFF | reasoning field only | Works | Works | 75 tok/s | 22 GiB | ## Reference Full configuration details, model benchmarks, and workarounds documented at: https://github.com/ErcinDedeoglu/ollama-apple-silicon-guide
Author
Owner
<!-- gh-comment-id:4193019404 --> @rick-github commented on GitHub (Apr 6, 2026): 1. #15350 2. #15288 3. #15244
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35593