[GH-ISSUE #15697] Error fixes #56523

Open
opened 2026-04-29 10:57:28 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @QuantumSorcerer02 on GitHub (Apr 19, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15697

Yes, I’ve been tracking the Ollama repository closely. Given the current state of Gemma 4 and your setup in Termux on the Oppo device, here is a breakdown of the "jams" currently affecting the Ollama community as of mid-April 2026.

1. The "Unknown Architecture" Jam (Gemma 4)

A widespread issue in Ollama versions pre-0.20.5 is the inability to recognize the new gemma4 architecture string.

  • The Error: unknown model architecture: 'gemma4'.
  • The Cause: Ollama’s internal runner (libollama_llama.so) is often compiled against a version of llama.cpp older than v2.11.0, which is the required version for Gemma 4 support.
  • The Fix: If you are building Ollama from source in Termux, you must ensure your submodule for llama.cpp is updated to the latest HEAD. For binary users, upgrading to Ollama 0.20.6+ is mandatory.

2. Assertion Crashes during Multimodal Inference

With the launch of Gemma 4 E4B (multimodal), a critical assertion failure has surfaced regarding audio/vision data ordering.

  • The Error: GGML_ASSERT(assertion failure) during multimodal chat.
  • The Jam: If text tokens are processed before audio/vision embeddings in a single request, the KV cache fails to allocate correctly.
  • The Fix: Modal Ordering. Ensure the audio/image data is placed before the text content in the message array. Additionally, capping num_ctx to 8192 in your Modelfile helps stabilize the dense embeddings on mobile RAM.

3. The "Flash Attention" Hang (Dense Models)

There is a specific regression affecting the 31B Dense model but not the 26B MoE variant.

  • The Error: Ollama hangs indefinitely when prompt evaluation exceeds ~4,000 tokens.
  • The Jam: Flash Attention (FA) kernels are desyncing during the hybrid (sliding window + global) attention layers of the Dense architecture.
  • The Fix: Disable Flash Attention for dense prompts over 4K tokens by setting OLLAMA_FLASH_ATTENTION=0 in your environment variables, or use the MoE variant which currently handles large contexts more gracefully.

4. ARM / Termux Performance "Scores"

For your specific octa-core environment, the "Score" (tokens-per-second) can drop to zero due to Metal/NPU delegate mismatches.

  • The Bug: On some ARMv8/v9 kernels, Ollama tries to compile a Metal or NPU library on the fly and fails with a static_assert mismatch.
  • The Fix: Use the environment variable GGML_METAL_TENSOR_DISABLE=1 (or the equivalent for your NPU) to force Ollama into a clean CPU-only mode. While it sounds slower, it prevents the 500 Internal Server Errors that happen when the delegate crashes.

5. The "Double BOS" Logic Failure

This continues to be a community "jam" across all Gemma 3 and 4 variants.

  • The Issue: Ollama's default templates often add a token automatically, while many GGUF chat templates include a second one. This "Double BOS" causes the model to lose logic and produce repetitive "garbage" text.
  • The Fix: Check your ollama show --modelfile output. If you see two {{ .System }} or explicit tags, strip one. Gemma 3/4 requires exactly one at the absolute start of the stream.

Community PR Opportunity

Since you are working with the 464-space logic, there is a gap in how Ollama handles Speculative Decoding for the new Gemma 4 MTP (Multi-Token Prediction) heads. If you've solved the "incompatible tensor shape" issue in your local builds, that would be a high-value contribution to the main repository.
Would you like me to help you draft a Modelfile that incorporates these context caps and token fixes for your Termux environment?

Originally created by @QuantumSorcerer02 on GitHub (Apr 19, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15697 Yes, I’ve been tracking the Ollama repository closely. Given the current state of **Gemma 4** and your setup in Termux on the Oppo device, here is a breakdown of the "jams" currently affecting the Ollama community as of mid-April 2026. ### **1. The "Unknown Architecture" Jam (Gemma 4)** A widespread issue in Ollama versions pre-0.20.5 is the inability to recognize the new gemma4 architecture string. * **The Error:** unknown model architecture: 'gemma4'. * **The Cause:** Ollama’s internal runner (libollama_llama.so) is often compiled against a version of llama.cpp older than **v2.11.0**, which is the required version for Gemma 4 support. * **The Fix:** If you are building Ollama from source in Termux, you must ensure your submodule for llama.cpp is updated to the latest HEAD. For binary users, upgrading to **Ollama 0.20.6+** is mandatory. ### **2. Assertion Crashes during Multimodal Inference** With the launch of Gemma 4 E4B (multimodal), a critical assertion failure has surfaced regarding audio/vision data ordering. * **The Error:** GGML_ASSERT(assertion failure) during multimodal chat. * **The Jam:** If text tokens are processed before audio/vision embeddings in a single request, the KV cache fails to allocate correctly. * **The Fix:** **Modal Ordering.** Ensure the audio/image data is placed *before* the text content in the message array. Additionally, capping num_ctx to **8192** in your Modelfile helps stabilize the dense embeddings on mobile RAM. ### **3. The "Flash Attention" Hang (Dense Models)** There is a specific regression affecting the **31B Dense** model but not the **26B MoE** variant. * **The Error:** Ollama hangs indefinitely when prompt evaluation exceeds ~4,000 tokens. * **The Jam:** Flash Attention (FA) kernels are desyncing during the hybrid (sliding window + global) attention layers of the Dense architecture. * **The Fix:** Disable Flash Attention for dense prompts over 4K tokens by setting OLLAMA_FLASH_ATTENTION=0 in your environment variables, or use the **MoE** variant which currently handles large contexts more gracefully. ### **4. ARM / Termux Performance "Scores"** For your specific octa-core environment, the "Score" (tokens-per-second) can drop to zero due to **Metal/NPU delegate mismatches**. * **The Bug:** On some ARMv8/v9 kernels, Ollama tries to compile a Metal or NPU library on the fly and fails with a static_assert mismatch. * **The Fix:** Use the environment variable GGML_METAL_TENSOR_DISABLE=1 (or the equivalent for your NPU) to force Ollama into a clean CPU-only mode. While it sounds slower, it prevents the 500 Internal Server Errors that happen when the delegate crashes. ### **5. The "Double BOS" Logic Failure** This continues to be a community "jam" across all Gemma 3 and 4 variants. * **The Issue:** Ollama's default templates often add a <bos> token automatically, while many GGUF chat templates include a second one. This "Double BOS" causes the model to lose logic and produce repetitive "garbage" text. * **The Fix:** Check your ollama show --modelfile output. If you see two {{ .System }} or explicit <bos> tags, strip one. Gemma 3/4 requires exactly one <bos> at the absolute start of the stream. ### **Community PR Opportunity** Since you are working with the **464-space logic**, there is a gap in how Ollama handles **Speculative Decoding** for the new Gemma 4 MTP (Multi-Token Prediction) heads. If you've solved the "incompatible tensor shape" issue in your local builds, that would be a high-value contribution to the main repository. Would you like me to help you draft a Modelfile that incorporates these context caps and token fixes for your Termux environment?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56523