[GH-ISSUE #15333] Gemma 4 E4B: intermittent GGML assertion crash during audio inference #9808

Open
opened 2026-04-12 22:40:55 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @wames32 on GitHub (Apr 5, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15333

What is the issue?

Description

The Ollama runner crashes intermittently when processing audio input with gemma4:e4b. The crash occurs after audio encoding succeeds, during the LLM forward pass. It happens roughly every 2-4 requests in sustained usage, not tied to any specific audio content.

Error

From server.log:

ggml.c:1690: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed

Followed by:

level=ERROR source=server.go:1612 msg="post predict" error="Post ... wsarecv: An existing connection was forcibly closed by the remote host."
level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"

Steps to reproduce

  1. Run ollama serve
  2. Send repeated audio requests to gemma4:e4b:
import ollama, base64

with open("test.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

for i in range(10):
    print(f"Request {i+1}")
    response = ollama.chat(
        model="gemma4:e4b",
        messages=[{
            "role": "user",
            "content": "Describe what you hear.",
            "images": [audio_b64],
        }],
        options={"num_ctx": 8000},
    )
    print(response.message.content[:100])
  1. Crash typically occurs within 2-5 requests

Notes

  • Audio encoding always succeeds (logs show audio: encoded shape="[2560 750]")
  • The crash happens during the forward pass, not during audio processing
  • After the crash, Ollama auto-restarts the runner (~6-8 seconds) and the next request succeeds
  • The issue is not related to conversation history — it happens with single-turn stateless requests
  • Reducing num_ctx from 13000 to 8000 did not prevent the crash

Relevant log output

ggml.c:1690: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed

level=ERROR source=server.go:1612 msg="post predict" error="Post ... wsarecv: An existing connection was forcibly closed by the remote host."
level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"

OS

Windows 11 Pro

GPU

RTX 3090

CPU

11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz (3.50 GHz)

Ollama version

0.20.2

Originally created by @wames32 on GitHub (Apr 5, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15333 ### What is the issue? ## Description The Ollama runner crashes intermittently when processing audio input with `gemma4:e4b`. The crash occurs after audio encoding succeeds, during the LLM forward pass. It happens roughly every 2-4 requests in sustained usage, not tied to any specific audio content. ## Error From `server.log`: ``` ggml.c:1690: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed ``` Followed by: ``` level=ERROR source=server.go:1612 msg="post predict" error="Post ... wsarecv: An existing connection was forcibly closed by the remote host." level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1" ``` ## Steps to reproduce 1. Run `ollama serve` 2. Send repeated audio requests to `gemma4:e4b`: ```python import ollama, base64 with open("test.wav", "rb") as f: audio_b64 = base64.b64encode(f.read()).decode() for i in range(10): print(f"Request {i+1}") response = ollama.chat( model="gemma4:e4b", messages=[{ "role": "user", "content": "Describe what you hear.", "images": [audio_b64], }], options={"num_ctx": 8000}, ) print(response.message.content[:100]) ``` 3. Crash typically occurs within 2-5 requests ## Notes - Audio encoding always succeeds (logs show `audio: encoded shape="[2560 750]"`) - The crash happens during the forward pass, not during audio processing - After the crash, Ollama auto-restarts the runner (~6-8 seconds) and the next request succeeds - The issue is not related to conversation history — it happens with single-turn stateless requests - Reducing `num_ctx` from 13000 to 8000 did not prevent the crash ### Relevant log output ```shell ggml.c:1690: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed level=ERROR source=server.go:1612 msg="post predict" error="Post ... wsarecv: An existing connection was forcibly closed by the remote host." level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1" ``` ### OS Windows 11 Pro ### GPU RTX 3090 ### CPU 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz (3.50 GHz) ### Ollama version 0.20.2
GiteaMirror added the bug label 2026-04-12 22:40:55 -05:00
Author
Owner

@rjdg14 commented on GitHub (Apr 5, 2026):

I can confirm that the issue also happens on MacOS. When sending a WAV file to an instance of Gemma 4 E4B through the terminal it usually gives an "Error: 500 Internal Server Error" error and crashes out of the chat, and the server.log file shows the following very similar error:

level=ERROR source=server.go:1612 msg="post predict" error="Post \"http://127.0.0.1:60134/completion\": EOF"
[GIN] 2026/04/05 - 16:36:24 | 500 |  2.296495958s |       127.0.0.1 | POST     "/api/chat"

Rather than every 2-4 times I attempt to get it to analyse an audio file, this error happens most of the time for me. I've been able to get it to consistently be able to analyse certain audio files under 30 seconds in length, but then there are others of the same length and codec with a similar name where it always crashes with this error. I don't know what's making some WAV files more prone to this issue than others.

I'm under the impression that the audio analysis implementation in the latest version of Ollama may have been vibe coded without any bug fixes, and feel this needs to be addressed.

<!-- gh-comment-id:4189076431 --> @rjdg14 commented on GitHub (Apr 5, 2026): I can confirm that the issue also happens on MacOS. When sending a WAV file to an instance of Gemma 4 E4B through the terminal it usually gives an "Error: 500 Internal Server Error" error and crashes out of the chat, and the server.log file shows the following very similar error: ``` level=ERROR source=server.go:1612 msg="post predict" error="Post \"http://127.0.0.1:60134/completion\": EOF" [GIN] 2026/04/05 - 16:36:24 | 500 | 2.296495958s | 127.0.0.1 | POST "/api/chat" ``` Rather than every 2-4 times I attempt to get it to analyse an audio file, this error happens most of the time for me. I've been able to get it to consistently be able to analyse certain audio files under 30 seconds in length, but then there are others of the same length and codec with a similar name where it always crashes with this error. I don't know what's making some WAV files more prone to this issue than others. I'm under the impression that the audio analysis implementation in the latest version of Ollama may have been vibe coded without any bug fixes, and feel this needs to be addressed.
Author
Owner

@rjdg14 commented on GitHub (Apr 5, 2026):

Google Gemini might be onto something with this troubleshooting response:

The technical irony of an AI that claims it's deaf while actually listening is amusing, but the crashing is a known "bleeding edge" bug in how Ollama handles the Gemma 4 multimodal architecture.

Since Gemma 4 was only recently released (April 2026), the implementation in Ollama (specifically the llama.cpp backend it uses) is still stabilizing. Here is what is likely happening and how you can work around it:

1. The "Forward Pass" Assertion Crash

Recent reports (e.g., GitHub issue #15333) show that Gemma 4 E4B often crashes after the audio has been successfully encoded. It isn't a problem with your WAV file or the codec; it's a memory alignment issue during the "forward pass" where the model tries to merge the audio tokens with the text tokens.

  • Why it feels random: It often depends on the specific "token length" the audio produces. If the audio features (spectrogram) result in a specific number of tokens that hit a boundary error in the GGML math kernels, the runner terminates immediately.

2. The "Modality Order" Sensitivity

Gemma 4 is surprisingly picky about where the audio "lives" in the prompt. Google’s own documentation for Gemma 4 recommends a specific order for multimodal inputs to prevent processing errors.

  • The Rule: Always place the Audio/Image content before the text prompt.
  • Why: If the text tokens are processed first, the KV cache might not leave enough "room" or proper alignment for the heavy audio embeddings that follow, triggering a crash in smaller models like E4B.

3. Practical Fixes

Until a new Ollama update (v0.20.3 or higher) specifically addresses the Gemma 4 GGML kernels, try these steps:

  • Standardize Sample Rate: Even if the codec is identical, ensure all files are 16kHz Mono. Gemma 4’s audio encoder (based on the newer USM/Gemini architecture) is optimized for 16k. Files at 44.1k or 48k require on-the-fly resampling which can lead to memory spikes and crashes in Ollama.
  • Adjust num_ctx: If your num_ctx is set too high (e.g., 32k+), the audio embeddings (which are quite dense) might be hitting the memory ceiling of your GPU/system RAM during the merge phase. Try forcing a smaller context:
    ollama run gemma4:e4b --options num_ctx=8192
    
  • Check the "Ghost" Processes: Sometimes a crash leaves a "zombie" runner process active in the background that holds onto VRAM. If it crashes once, it's significantly more likely to crash again until you restart the Ollama service:
    # Linux
    sudo systemctl restart ollama
    

Summary: You aren't doing anything wrong. The "I cannot hear" thought is a hallucination of its identity, and the crash is a math error in the engine's current version. If a file keeps crashing, try trimming just 0.5 seconds off the end—often, changing the total token count by even a tiny bit bypasses the specific alignment bug!

<!-- gh-comment-id:4189136006 --> @rjdg14 commented on GitHub (Apr 5, 2026): Google Gemini might be onto something with this troubleshooting response: > The technical irony of an AI that claims it's deaf while actually listening is amusing, but the crashing is a known "bleeding edge" bug in how Ollama handles the Gemma 4 multimodal architecture. Since Gemma 4 was only recently released (April 2026), the implementation in Ollama (specifically the `llama.cpp` backend it uses) is still stabilizing. Here is what is likely happening and how you can work around it: ### 1. The "Forward Pass" Assertion Crash Recent reports (e.g., GitHub issue #15333) show that Gemma 4 E4B often crashes **after** the audio has been successfully encoded. It isn't a problem with your WAV file or the codec; it's a memory alignment issue during the "forward pass" where the model tries to merge the audio tokens with the text tokens. * **Why it feels random:** It often depends on the specific "token length" the audio produces. If the audio features (spectrogram) result in a specific number of tokens that hit a boundary error in the GGML math kernels, the runner terminates immediately. ### 2. The "Modality Order" Sensitivity Gemma 4 is surprisingly picky about where the audio "lives" in the prompt. Google’s own documentation for Gemma 4 recommends a specific order for multimodal inputs to prevent processing errors. * **The Rule:** Always place the **Audio/Image content before the text prompt**. * **Why:** If the text tokens are processed first, the KV cache might not leave enough "room" or proper alignment for the heavy audio embeddings that follow, triggering a crash in smaller models like E4B. ### 3. Practical Fixes Until a new Ollama update (v0.20.3 or higher) specifically addresses the Gemma 4 GGML kernels, try these steps: * **Standardize Sample Rate:** Even if the codec is identical, ensure all files are **16kHz Mono**. Gemma 4’s audio encoder (based on the newer USM/Gemini architecture) is optimized for 16k. Files at 44.1k or 48k require on-the-fly resampling which can lead to memory spikes and crashes in Ollama. * **Adjust `num_ctx`:** If your `num_ctx` is set too high (e.g., 32k+), the audio embeddings (which are quite dense) might be hitting the memory ceiling of your GPU/system RAM during the merge phase. Try forcing a smaller context: ```bash ollama run gemma4:e4b --options num_ctx=8192 ``` * **Check the "Ghost" Processes:** Sometimes a crash leaves a "zombie" runner process active in the background that holds onto VRAM. If it crashes once, it's significantly more likely to crash again until you restart the Ollama service: ```bash # Linux sudo systemctl restart ollama ``` **Summary:** You aren't doing anything wrong. The "I cannot hear" thought is a hallucination of its identity, and the crash is a math error in the engine's current version. If a file keeps crashing, try trimming just 0.5 seconds off the end—often, changing the total token count by even a tiny bit bypasses the specific alignment bug!
Author
Owner

@rick-github commented on GitHub (Apr 5, 2026):

Google Gemini might be onto something with this troubleshooting response:

Google Gemini gets multiple things wrong.

Please provide full logs to aid in debugging, and, if possible, an audio file that demonstrates the problem.

<!-- gh-comment-id:4189246142 --> @rick-github commented on GitHub (Apr 5, 2026): > Google Gemini might be onto something with this troubleshooting response: Google Gemini gets multiple things wrong. Please provide full logs to aid in debugging, and, if possible, an audio file that demonstrates the problem.
Author
Owner

@joshuachris2001 commented on GitHub (Apr 8, 2026):

How about this?

time=2026-04-08T03:03:47.838-07:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-08T03:03:47.891-07:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-08T03:03:47.994-07:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-08T03:03:48.068-07:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-08T03:03:48.068-07:00 level=INFO source=ggml.go:482 msg="offloading 0 repeating layers to GPU"
time=2026-04-08T03:03:48.068-07:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="6.1 GiB"
time=2026-04-08T03:03:48.068-07:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-04-08T03:03:48.068-07:00 level=INFO source=ggml.go:494 msg="offloaded 0/43 layers to GPU"
time=2026-04-08T03:03:48.069-07:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="119.0 MiB"
time=2026-04-08T03:03:48.070-07:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="125.0 MiB"
time=2026-04-08T03:03:48.071-07:00 level=INFO source=device.go:272 msg="total memory" size="6.3 GiB"
time=2026-04-08T03:03:48.071-07:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-04-08T03:03:48.071-07:00 level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
time=2026-04-08T03:03:48.072-07:00 level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model"
time=2026-04-08T03:03:52.586-07:00 level=INFO source=server.go:1390 msg="llama runner started in 5.02 seconds"
time=2026-04-08T03:03:53.334-07:00 level=INFO source=model.go:176 msg="audio: decode" elapsed=64.3402ms samples=2595433 duration_s=162.2145625
time=2026-04-08T03:04:02.090-07:00 level=INFO source=model.go:188 msg="audio: mel" frames=16219 elapsed=8.8198297s
ggml_new_object: not enough space in the context's memory pool (needed 19677696, available 19677664)
ggml.c:1705: GGML_ASSERT(obj_new) failed
time=2026-04-08T03:04:02.147-07:00 level=ERROR source=server.go:1612 msg="post predict" error="Post \"http://127.0.0.1:64337/completion\": read tcp 127.0.0.1:64346->127.0.0.1:64337: wsarecv: An existing connection was forcibly closed by the remote host."

a wav 2:42 long with a wav conversion of asdfmovie16. (I wanted to see how Gemma4 e4b; handed non linear contexted audio, lol)

Ah, I see; Gemma 4 only allows 60 seconds worth of audio...

<!-- gh-comment-id:4205537780 --> @joshuachris2001 commented on GitHub (Apr 8, 2026): How about this? ``` time=2026-04-08T03:03:47.838-07:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-08T03:03:47.891-07:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-08T03:03:47.994-07:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-08T03:03:48.068-07:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-08T03:03:48.068-07:00 level=INFO source=ggml.go:482 msg="offloading 0 repeating layers to GPU" time=2026-04-08T03:03:48.068-07:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="6.1 GiB" time=2026-04-08T03:03:48.068-07:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-04-08T03:03:48.068-07:00 level=INFO source=ggml.go:494 msg="offloaded 0/43 layers to GPU" time=2026-04-08T03:03:48.069-07:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="119.0 MiB" time=2026-04-08T03:03:48.070-07:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="125.0 MiB" time=2026-04-08T03:03:48.071-07:00 level=INFO source=device.go:272 msg="total memory" size="6.3 GiB" time=2026-04-08T03:03:48.071-07:00 level=INFO source=sched.go:561 msg="loaded runners" count=1 time=2026-04-08T03:03:48.071-07:00 level=INFO source=server.go:1352 msg="waiting for llama runner to start responding" time=2026-04-08T03:03:48.072-07:00 level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model" time=2026-04-08T03:03:52.586-07:00 level=INFO source=server.go:1390 msg="llama runner started in 5.02 seconds" time=2026-04-08T03:03:53.334-07:00 level=INFO source=model.go:176 msg="audio: decode" elapsed=64.3402ms samples=2595433 duration_s=162.2145625 time=2026-04-08T03:04:02.090-07:00 level=INFO source=model.go:188 msg="audio: mel" frames=16219 elapsed=8.8198297s ggml_new_object: not enough space in the context's memory pool (needed 19677696, available 19677664) ggml.c:1705: GGML_ASSERT(obj_new) failed time=2026-04-08T03:04:02.147-07:00 level=ERROR source=server.go:1612 msg="post predict" error="Post \"http://127.0.0.1:64337/completion\": read tcp 127.0.0.1:64346->127.0.0.1:64337: wsarecv: An existing connection was forcibly closed by the remote host." ``` a wav 2:42 long with a wav conversion of asdfmovie16. (I wanted to see how Gemma4 e4b; handed non linear contexted audio, lol) Ah, I see; Gemma 4 only allows 60 seconds worth of audio...
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9808