[GH-ISSUE #15369] Bug: granite-4.0-1b-GGUF:Q4_K_M crashes with assertion failure in llama_sampler_dist_apply #9834

Open
opened 2026-04-12 22:42:01 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @kndtran on GitHub (Apr 6, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15369

What is the issue?

cc: @gabe-l-hart

Description

Loading hf.co/ibm-granite/granite-4.0-1b-GGUF:Q4_K_M from HuggingFace crashes immediately on the first inference call with an assertion failure in llama-sampling.cpp. The model loads successfully (all layers offloaded), but the sampler aborts during the first token generation.

The BF16 version from the Ollama library (granite4:1b) works fine. Other quantizations of the same GGUF repo have not been tested.

Reproduction

ollama run hf.co/ibm-granite/granite-4.0-1b-GGUF:Q4_K_M "Hello"

Expected: Model generates a response.
Actual: Error: 500 Internal Server Error: model runner has unexpectedly stopped

Environment

Crash Log

From ~/.ollama/logs/server.log:

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   160.78 MiB
load_tensors: Metal_Mapped model buffer size =   972.82 MiB
llama_context: constructing llama_context
llama_context: n_ctx         = 131072
llama_context: n_batch       = 512
llama_context: flash_attn    = auto
llama_kv_cache: size = 10240.00 MiB (131072 cells, 40 layers, 1/1 seqs)
llama_context: Flash Attention was auto, set to enabled
time=2026-04-04T08:50:42.148-07:00 level=INFO source=server.go:1390 msg="llama runner started in 1.42 seconds"
Assertion failed: (found), function llama_sampler_dist_apply, file llama-sampling.cpp, line 660.
SIGABRT: abort
PC=0x196b8d5b0 m=7 sigcode=0
signal arrived during cgo execution

The Go stack trace shows the crash originates in:

github.com/ollama/ollama/llama._Cfunc_common_sampler_csample(0x1056e1110, 0x730c9db00, 0x1e)
    _cgo_gotypes.go:425

Notes

  • The model loads and the runner starts successfully. The crash occurs on the first sampling call, not during model loading.
  • The assertion (found) in llama_sampler_dist_apply (llama-sampling.cpp:660) suggests the sampler cannot find an expected token in the probability distribution.
  • granite4:1b from the Ollama library (BF16, 1.6B params, same architecture) works correctly.
  • Other Granite 4.0 GGUF models from HuggingFace work fine: granite-4.0-micro-GGUF (3.4B) and granite-4.0-350m-GGUF (0.4B), all quantizations (Q4_K_M, Q5_K_M, Q8_0, F16).
  • The issue is specific to the granite-4.0-1b-GGUF GGUF file, not Ollama version or hardware.

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @kndtran on GitHub (Apr 6, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15369 ### What is the issue? cc: @gabe-l-hart ## Description Loading `hf.co/ibm-granite/granite-4.0-1b-GGUF:Q4_K_M` from HuggingFace crashes immediately on the first inference call with an assertion failure in `llama-sampling.cpp`. The model loads successfully (all layers offloaded), but the sampler aborts during the first token generation. The BF16 version from the Ollama library (`granite4:1b`) works fine. Other quantizations of the same GGUF repo have not been tested. ## Reproduction ```bash ollama run hf.co/ibm-granite/granite-4.0-1b-GGUF:Q4_K_M "Hello" ``` **Expected:** Model generates a response. **Actual:** `Error: 500 Internal Server Error: model runner has unexpectedly stopped` ## Environment - **Ollama versions tested:** 0.18.3, 0.20.0, 0.20.2 (latest as of 2026-04-06, all crash) - **OS:** macOS 26.3.1 (arm64) - **Hardware:** Apple M1 Max, 64 GB unified memory - **Model:** `hf.co/ibm-granite/granite-4.0-1b-GGUF:Q4_K_M` from https://huggingface.co/ibm-granite/granite-4.0-1b-GGUF ## Crash Log From `~/.ollama/logs/server.log`: ``` load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 40 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 41/41 layers to GPU load_tensors: CPU_Mapped model buffer size = 160.78 MiB load_tensors: Metal_Mapped model buffer size = 972.82 MiB llama_context: constructing llama_context llama_context: n_ctx = 131072 llama_context: n_batch = 512 llama_context: flash_attn = auto llama_kv_cache: size = 10240.00 MiB (131072 cells, 40 layers, 1/1 seqs) llama_context: Flash Attention was auto, set to enabled time=2026-04-04T08:50:42.148-07:00 level=INFO source=server.go:1390 msg="llama runner started in 1.42 seconds" Assertion failed: (found), function llama_sampler_dist_apply, file llama-sampling.cpp, line 660. SIGABRT: abort PC=0x196b8d5b0 m=7 sigcode=0 signal arrived during cgo execution ``` The Go stack trace shows the crash originates in: ``` github.com/ollama/ollama/llama._Cfunc_common_sampler_csample(0x1056e1110, 0x730c9db00, 0x1e) _cgo_gotypes.go:425 ``` ## Notes - The model loads and the runner starts successfully. The crash occurs on the **first sampling call**, not during model loading. - The assertion `(found)` in `llama_sampler_dist_apply` (`llama-sampling.cpp:660`) suggests the sampler cannot find an expected token in the probability distribution. - `granite4:1b` from the Ollama library (BF16, 1.6B params, same architecture) works correctly. - Other Granite 4.0 GGUF models from HuggingFace work fine: `granite-4.0-micro-GGUF` (3.4B) and `granite-4.0-350m-GGUF` (0.4B), all quantizations (Q4_K_M, Q5_K_M, Q8_0, F16). - The issue is specific to the `granite-4.0-1b-GGUF` GGUF file, not Ollama version or hardware. ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-12 22:42:01 -05:00
Author
Owner

@gabe-l-hart commented on GitHub (Apr 6, 2026):

@kndtran Thanks for raising this. The assertion you're seeing (Assertion failed: (found), function llama_sampler_dist_apply, file llama-sampling.cpp, line 660.) happens when the logits end up with NaNs or Infs. This happens because the underlying model precision overflows. We've seen this a lot for these small Granite models with F16. It's a sometimes problem since it's data-dependent and it also depends on which specific tensors are in which precision (NOTE: Q4_K_M keeps some tensors in "native" format which defaults to F16).

The other thing that could likely be contributing to this (and the likely reason why the official version does not fail) is that running a model from HF directly will not have the right chat template and therefore will have more out-of-distribution activations. Ollama uses go templates, not jinja2, so when pulling from HF directly it tries to map to an appropriate go template. AFAIK, we've never added the official Granite 4 go templates to the set of defaults (something we should probably do), so it's likely picking up on the wrong go template and generating unexpected token sequences.

<!-- gh-comment-id:4194674598 --> @gabe-l-hart commented on GitHub (Apr 6, 2026): @kndtran Thanks for raising this. The assertion you're seeing (`Assertion failed: (found), function llama_sampler_dist_apply, file llama-sampling.cpp, line 660.`) happens when the logits end up with NaNs or Infs. This happens because the underlying model precision overflows. We've seen this _a lot_ for these small Granite models with `F16`. It's a _sometimes_ problem since it's data-dependent and it also depends on which specific tensors are in which precision (NOTE: `Q4_K_M` keeps some tensors in "native" format which defaults to `F16`). The other thing that could likely be contributing to this (and the likely reason why the official version does not fail) is that running a model from HF directly will not have the right chat template and therefore will have more out-of-distribution activations. Ollama uses go templates, not jinja2, so when pulling from HF directly it tries to map to an appropriate go template. AFAIK, we've never added the official Granite 4 go templates to the set of defaults (something we should probably do), so it's likely picking up on the wrong go template and generating unexpected token sequences.
Author
Owner

@kndtran commented on GitHub (Apr 6, 2026):

I tested the 2 potential issues:

  1. Wrong chat template
    1. I used the template from the official model in Ollama: ollama show granite4:1b --template 2>&1 > /tmp/granite4_template.txt
    2. Still crashes with the correct Go chat template
  2. Precision overflow
    1. I tested these quantizations from HF: Q4_K_M, Q8_0, BF16, and F16
    2. Only BF16 works, the others all crash in the same way

So, it seems like it's an F16 overflow issue and not a chat template issue.

<!-- gh-comment-id:4195446295 --> @kndtran commented on GitHub (Apr 6, 2026): I tested the 2 potential issues: 1. Wrong chat template 1. I used the template from the official model in Ollama: `ollama show granite4:1b --template 2>&1 > /tmp/granite4_template.txt` 2. Still crashes with the correct Go chat template 2. Precision overflow 1. I tested these quantizations from HF: `Q4_K_M`, `Q8_0`, `BF16`, and `F16` 2. Only `BF16` works, the others all crash in the same way So, it seems like it's an F16 overflow issue and not a chat template issue.
Author
Owner

@gabe-l-hart commented on GitHub (Apr 6, 2026):

@kndtran ok great, thanks for validating. Unfortunately, this isn't a bug we can tackle here since it has to do with precision issues in ggml ops on a per-backend basis.

It looks like you're hitting this on Metal. IIRC, you can also mitigate the overflow by keeping some layers off the Metal kernel (/set parameter num_gpu X where X is less than the total). Are you able to test with different num_gpu values easily?

The other option for us (IBM) to pursue would be to rebuild the quantized versions off of the BF16 GGUF rather than the F16 GGUF. Cc @mrutkows

<!-- gh-comment-id:4195502117 --> @gabe-l-hart commented on GitHub (Apr 6, 2026): @kndtran ok great, thanks for validating. Unfortunately, this isn't a bug we can tackle here since it has to do with precision issues in `ggml` ops on a per-backend basis. It looks like you're hitting this on Metal. IIRC, you can also mitigate the overflow by keeping some layers off the Metal kernel (`/set parameter num_gpu X` where X is less than the total). Are you able to test with different `num_gpu` values easily? The other option for us (IBM) to pursue would be to rebuild the quantized versions off of the BF16 GGUF rather than the F16 GGUF. Cc @mrutkows
Author
Owner

@kndtran commented on GitHub (Apr 7, 2026):

Well, that parameter is poorly named. TIL it's the number of layers offloaded to the GPU.

I did a quick few runs and

  • num_gpu=35 works,
  • num_gpu=36 crashes,
  • num_gpu=0 works, and
  • num_gpu=41 (default) crashes.

This workaround seems okay, I'm getting about 77 tok/s (Q4_K_M above) vs 76 tok/s BF16 full GPU. Of course, I would prefer a quantized version without a workaround.

Please let me know if you decide to rebuild the quantized versions off BF16, I will need to rerun my experiments. For now, I will use the workaround since a fix in ggml seems far away given it's a precision issue.

<!-- gh-comment-id:4195868042 --> @kndtran commented on GitHub (Apr 7, 2026): Well, that parameter is poorly named. TIL it's the number of layers offloaded to the GPU. I did a quick few runs and - `num_gpu=35` works, - `num_gpu=36` crashes, - `num_gpu=0` works, and - `num_gpu=41` (default) crashes. This workaround seems okay, I'm getting about 77 tok/s (`Q4_K_M` above) vs 76 tok/s `BF16` full GPU. Of course, I would prefer a quantized version without a workaround. Please let me know if you decide to rebuild the quantized versions off BF16, I will need to rerun my experiments. For now, I will use the workaround since a fix in `ggml` seems far away given it's a precision issue.
Author
Owner

@mrutkows commented on GitHub (Apr 9, 2026):

Please let me know if you decide to rebuild the quantized versions off BF16, I will need to rerun my experiments. For now, I will use the workaround since a fix in ggml seems far away given it's a precision issue.

Hi @kndtran, Thanks so much for your feedback and raising this issue which we have seen ourselves (unfortunately) on various hardware (and on OS native software math library versions).

Gabe and I have discussed addressing this using bf16 and avoiding f16 for some time, but this would cause a refresh of all quants. derived from the base bf16 conversion... which would have wide scale impact on all downstreams from the source HF GGUF repo. where we build the models to... but, we are putting this high on our radar for discussion so we have a plan that everyone agrees to and can support.

<!-- gh-comment-id:4216449205 --> @mrutkows commented on GitHub (Apr 9, 2026): > Please let me know if you decide to rebuild the quantized versions off BF16, I will need to rerun my experiments. For now, I will use the workaround since a fix in `ggml` seems far away given it's a precision issue. Hi @kndtran, Thanks so much for your feedback and raising this issue which we have seen ourselves (unfortunately) on various hardware (and on OS native software math library versions). Gabe and I have discussed addressing this using bf16 and avoiding f16 for some time, but this would cause a refresh of all quants. derived from the base bf16 conversion... which would have wide scale impact on all downstreams from the source HF GGUF repo. where we build the models to... but, we are putting this high on our radar for discussion so we have a plan that everyone agrees to and can support.
Author
Owner

@planetf1 commented on GitHub (Apr 9, 2026):

Just seen the same issue running with phi:2.7b (albeit a very old model)

<!-- gh-comment-id:4217803001 --> @planetf1 commented on GitHub (Apr 9, 2026): Just seen the same issue running with `phi:2.7b` (albeit a very old model)
Author
Owner

@gabe-l-hart commented on GitHub (Apr 10, 2026):

@planetf1 Interesting! Looking at that model I see that the base precision for the unquantized tensors is F32, not F16, so I suspect it's a different underlying issue resulting in the same assertion error. That assertion error comes up any time a NaN or Inf show up in the GGML math, resulting in output logits that fail in sampling.

<!-- gh-comment-id:4225264952 --> @gabe-l-hart commented on GitHub (Apr 10, 2026): @planetf1 Interesting! Looking at [that model](https://ollama.com/library/phi:2.7b/blobs/04778965089b) I see that the base precision for the unquantized tensors is F32, not F16, so I suspect it's a different underlying issue resulting in the same assertion error. That assertion error comes up any time a `NaN` or `Inf` show up in the GGML math, resulting in output logits that fail in sampling.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9834