[GH-ISSUE #13572] Ollama embedding model bge-m3 produces Nan output in some seemingly unrelated cases #34697

Closed
opened 2026-04-22 18:27:41 -05:00 by GiteaMirror · 17 comments
Owner

Originally created by @JuergenMS on GitHub (Dec 26, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13572

What is the issue?

The following command produces NaN:
curl http://localhost:11434/api/embed -d '{"model":"bge-m3","input":"Titel: Geschichte der Quantentheorie -- Thema: Wissenschaft- und Technikgeschichte - Physik -- Anmerkungen: Darstellung der Strukturen der Quantentheorie von Plank bis zu den Anfängen der Quantenfeldtheorie"}'
{"error":"failed to encode response: json: unsupported value: NaN"}

Removing the last word "Quantentheorie" or replacing it with "A" returns an embedding.

This behavior occurs with 76 texts out of 1217.
I cannot detect any pattern, I checked for special characters and I also reinstalled the model with no effect. The 76 cases are either English or German and have different lengths.
I attached the commands and the debug logs for the success and the NaN case.
Ollama version is 0.13.5, bge-me is latest.

Ollama-OK.txt
Ollama-Nan.txt

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @JuergenMS on GitHub (Dec 26, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13572 ### What is the issue? The following command produces NaN: curl http://localhost:11434/api/embed -d '{"model":"bge-m3","input":"Titel: Geschichte der Quantentheorie -- Thema: Wissenschaft- und Technikgeschichte - Physik -- Anmerkungen: Darstellung der Strukturen der Quantentheorie von Plank bis zu den Anfängen der Quantenfeldtheorie"}' {"error":"failed to encode response: json: unsupported value: NaN"} Removing the last word "Quantentheorie" or replacing it with "A" returns an embedding. This behavior occurs with 76 texts out of 1217. I cannot detect any pattern, I checked for special characters and I also reinstalled the model with no effect. The 76 cases are either English or German and have different lengths. I attached the commands and the debug logs for the success and the NaN case. Ollama version is 0.13.5, bge-me is latest. [Ollama-OK.txt](https://github.com/user-attachments/files/24350523/Ollama-OK.txt) [Ollama-Nan.txt](https://github.com/user-attachments/files/24350522/Ollama-Nan.txt) ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-22 18:27:41 -05:00
Author
Owner

@JuergenMS commented on GitHub (Dec 26, 2025):

btw: other models like nomic-embed-text:latest work but produce lower quality results since they do not support German very well.

<!-- gh-comment-id:3693320724 --> @JuergenMS commented on GitHub (Dec 26, 2025): btw: other models like nomic-embed-text:latest work but produce lower quality results since they do not support German very well.
Author
Owner

@anumukul commented on GitHub (Dec 26, 2025):

Hi, I’d like to take this issue and can deliver a fix within 24 hours.
I’ve worked on similar projects before and have relevant experience, so I should be able to handle this efficiently.

<!-- gh-comment-id:3693470724 --> @anumukul commented on GitHub (Dec 26, 2025): Hi, I’d like to take this issue and can deliver a fix within 24 hours. I’ve worked on similar projects before and have relevant experience, so I should be able to handle this efficiently.
Author
Owner

@JuergenMS commented on GitHub (Dec 27, 2025):

Thank you very much for your help!!

<!-- gh-comment-id:3694274379 --> @JuergenMS commented on GitHub (Dec 27, 2025): Thank you very much for your help!!
Author
Owner

@JuergenMS commented on GitHub (Jan 18, 2026):

I tested this bug with the data and config used above using ollama 014.3.-rc0. I still get the same NaN results.

<!-- gh-comment-id:3765402972 --> @JuergenMS commented on GitHub (Jan 18, 2026): I tested this bug with the data and config used above using ollama 014.3.-rc0. I still get the same NaN results.
Author
Owner

@krusmir commented on GitHub (Jan 20, 2026):

Root Cause Analysis

I traced this bug to the F32→F16 cast in llama.cpp/src/llama-graph.cpp (lines 1431-1437).

When flash attention is enabled for embedding models (which don't use KV cache), K and V tensors are F32 and get cast to F16 before ggml_flash_attn_ext:

// this can happen when KV cache is not used (e.g. an embedding model with non-causal attn)
if (k->type == GGML_TYPE_F32) {
    k = ggml_cast(ctx0, k, GGML_TYPE_F16);
}
if (v->type == GGML_TYPE_F32) {
    v = ggml_cast(ctx0, v, GGML_TYPE_F16);
}

F16 max value is ~65504. When intermediate values exceed this during cast, they overflow to Inf, which propagates through softmax and becomes NaN.

Proposed Fix

Skip flash attention when K is F32 (embedding models without KV cache):

const bool use_flash_attn = cparams.flash_attn && kq_b == nullptr && k->type != GGML_TYPE_F32;

This makes embedding models use the standard attention path which keeps F32 precision throughout.

Workaround

Setting OLLAMA_FLASH_ATTENTION=false resolves the issue by disabling flash attention entirely.

<!-- gh-comment-id:3774192946 --> @krusmir commented on GitHub (Jan 20, 2026): ## Root Cause Analysis I traced this bug to the F32→F16 cast in `llama.cpp/src/llama-graph.cpp` (lines 1431-1437). When flash attention is enabled for embedding models (which don't use KV cache), K and V tensors are F32 and get cast to F16 before `ggml_flash_attn_ext`: ```cpp // this can happen when KV cache is not used (e.g. an embedding model with non-causal attn) if (k->type == GGML_TYPE_F32) { k = ggml_cast(ctx0, k, GGML_TYPE_F16); } if (v->type == GGML_TYPE_F32) { v = ggml_cast(ctx0, v, GGML_TYPE_F16); } ``` F16 max value is ~65504. When intermediate values exceed this during cast, they overflow to `Inf`, which propagates through softmax and becomes `NaN`. ## Proposed Fix Skip flash attention when K is F32 (embedding models without KV cache): ```cpp const bool use_flash_attn = cparams.flash_attn && kq_b == nullptr && k->type != GGML_TYPE_F32; ``` This makes embedding models use the standard attention path which keeps F32 precision throughout. ## Workaround Setting `OLLAMA_FLASH_ATTENTION=false` resolves the issue by disabling flash attention entirely.
Author
Owner

@JuergenMS commented on GitHub (Jan 20, 2026):

I run ollama as systemd and set "Environment=OLLAMA_FLASH_ATTENTION=0" in the override config. I still get the same error.

<!-- gh-comment-id:3774957403 --> @JuergenMS commented on GitHub (Jan 20, 2026): I run ollama as systemd and set "Environment=OLLAMA_FLASH_ATTENTION=0" in the override config. I still get the same error.
Author
Owner

@krusmir commented on GitHub (Jan 20, 2026):

Confirmed: Bug introduced in v0.13.5

Decided to test a few ollama versions (v.0.13.3, v0.13.4, v0.13.5 up to v0.14.0) and the bug doesn't exist in versions < v0.13.5. The bug was introduced in v0.13.5 when "bert" was added to the architectures that auto-enable flash attention.

Version comparison:

Version bert in FlashAttention() Bug Present
v0.13.3 No No
v0.13.4 No No
v0.13.5 Yes Yes
v0.14.0 Yes Yes

I tested both texts (yours and one I additionally found) on v0.13.5 and confirmed the NaN error:

# English text
curl -X POST "http://localhost:11434/api/embed" \
  -H "Content-Type: application/json" \
  -d '{"model":"bge-m3","input":"To train models for this task, there are currently two large datasets available to the community,"}'
# Returns: {"error":"failed to encode response: json: unsupported value: NaN"}

# German text (your example)
curl -X POST "http://localhost:11434/api/embed" \
  -H "Content-Type: application/json" \
  -d '{"model":"bge-m3","input":"Titel: Geschichte der Quantentheorie -- Thema: Wissenschaft- und Technikgeschichte - Physik -- Anmerkungen: Darstellung der Strukturen der Quantentheorie von Plank bis zu den Anfängen der Quantenfeldtheorie"}'
# Returns: {"error":"failed to encode response: json: unsupported value: NaN"}

For your issue with OLLAMA_FLASH_ATTENTION=0 not working:

  1. Verify the env var is actually loaded:

    systemctl show ollama --property=Environment
    

    or

    cat /proc/$(pgrep ollama)/environ | tr '\0' '\n' | grep FLASH
    
  2. Try these alternatives:

    • Use false instead of 0: OLLAMA_FLASH_ATTENTION=false
    • Make sure you ran systemctl daemon-reload after editing the override
  3. What GPU are you using? The bug may have GPU-specific behavior.

  4. Alternative workaround: Downgrade to v0.13.4 where bert wasn't added to the flash attention list yet.


Root Cause (for reference)

In v0.13.5, "bert" was added to FlashAttention() in fs/ggml/ggml.go:

func (f GGML) FlashAttention() bool {
    return slices.Contains([]string{
        "bert",  // <-- Added in v0.13.5, causes the bug for bge-m3
        "gemma3",
        ...
    }, f.KV().String("general.architecture"))
}

This auto-enables flash attention for BERT-based models like bge-m3, triggering the F32→F16 overflow issue I described in my previous comment.

<!-- gh-comment-id:3775230501 --> @krusmir commented on GitHub (Jan 20, 2026): ## Confirmed: Bug introduced in v0.13.5 Decided to test a few ollama versions (`v.0.13.3`, `v0.13.4`, `v0.13.5` up to `v0.14.0`) and the bug doesn't exist in versions < **v0.13.5**. The bug was introduced in **v0.13.5** when `"bert"` was added to the architectures that auto-enable flash attention. **Version comparison:** | Version | `bert` in FlashAttention() | Bug Present | |---------|---------------------------|-------------| | v0.13.3 | No | No | | v0.13.4 | No | No | | **v0.13.5** | **Yes** | **Yes** | | v0.14.0 | Yes | Yes | I tested both texts (yours and one I additionally found) on v0.13.5 and confirmed the NaN error: ```bash # English text curl -X POST "http://localhost:11434/api/embed" \ -H "Content-Type: application/json" \ -d '{"model":"bge-m3","input":"To train models for this task, there are currently two large datasets available to the community,"}' # Returns: {"error":"failed to encode response: json: unsupported value: NaN"} # German text (your example) curl -X POST "http://localhost:11434/api/embed" \ -H "Content-Type: application/json" \ -d '{"model":"bge-m3","input":"Titel: Geschichte der Quantentheorie -- Thema: Wissenschaft- und Technikgeschichte - Physik -- Anmerkungen: Darstellung der Strukturen der Quantentheorie von Plank bis zu den Anfängen der Quantenfeldtheorie"}' # Returns: {"error":"failed to encode response: json: unsupported value: NaN"} ``` --- ### For your issue with `OLLAMA_FLASH_ATTENTION=0` not working: 1. **Verify the env var is actually loaded:** ```bash systemctl show ollama --property=Environment ``` or ```bash cat /proc/$(pgrep ollama)/environ | tr '\0' '\n' | grep FLASH ``` 2. **Try these alternatives:** - Use `false` instead of `0`: `OLLAMA_FLASH_ATTENTION=false` - Make sure you ran `systemctl daemon-reload` after editing the override 3. **What GPU are you using?** The bug may have GPU-specific behavior. 4. **Alternative workaround:** Downgrade to **v0.13.4** where `bert` wasn't added to the flash attention list yet. --- ### Root Cause (for reference) In v0.13.5, `"bert"` was added to `FlashAttention()` in `fs/ggml/ggml.go`: ```go func (f GGML) FlashAttention() bool { return slices.Contains([]string{ "bert", // <-- Added in v0.13.5, causes the bug for bge-m3 "gemma3", ... }, f.KV().String("general.architecture")) } ``` This auto-enables flash attention for BERT-based models like bge-m3, triggering the F32→F16 overflow issue I described in my previous comment.
Author
Owner

@sleepyddl commented on GitHub (Jan 21, 2026):

OLLAMA_FLASH_ATTENTION=false , it works in my computer. My version is 14.2

<!-- gh-comment-id:3776017599 --> @sleepyddl commented on GitHub (Jan 21, 2026): OLLAMA_FLASH_ATTENTION=false , it works in my computer. My version is 14.2
Author
Owner

@morioka commented on GitHub (Jan 21, 2026):

OLLAMA_FLASH_ATTENTION=false ,jeffh/intfloat-multilingual-e5-large:f16 works well on 0.14.3. Thanks you.

<!-- gh-comment-id:3776348111 --> @morioka commented on GitHub (Jan 21, 2026): OLLAMA_FLASH_ATTENTION=false ,jeffh/intfloat-multilingual-e5-large:f16 works well on 0.14.3. Thanks you.
Author
Owner

@JuergenMS commented on GitHub (Jan 21, 2026):

It is also working on my machine. Thank you!

<!-- gh-comment-id:3777963157 --> @JuergenMS commented on GitHub (Jan 21, 2026): It is also working on my machine. Thank you!
Author
Owner

@danielguerra69 commented on GitHub (Jan 27, 2026):

When I use , OLLAMA_FLASH_ATTENTION=false , ollama uses more memory for non embedding models

<!-- gh-comment-id:3802769630 --> @danielguerra69 commented on GitHub (Jan 27, 2026): When I use , OLLAMA_FLASH_ATTENTION=false , ollama uses more memory for non embedding models
Author
Owner

@danielguerra69 commented on GitHub (Jan 27, 2026):

Ok, for me it is a memory issue. I was running qwen3-embedding:8b-fp16 on 2X16GGPU which uses 17G, should be no problem on a 32G system. Then i went back to a smaller model embeddinggemma:300m, and no problem. Now I use qwen3-embedding:4b-fp16 that uses 9.6 GB, so it fits on 1 card. That solved the problem for me, without the FLASH_ATTENTION environment (like i said that setting increases about 30% of the memory on most regular models)

<!-- gh-comment-id:3806916925 --> @danielguerra69 commented on GitHub (Jan 27, 2026): Ok, for me it is a memory issue. I was running qwen3-embedding:8b-fp16 on 2X16GGPU which uses 17G, should be no problem on a 32G system. Then i went back to a smaller model embeddinggemma:300m, and no problem. Now I use qwen3-embedding:4b-fp16 that uses 9.6 GB, so it fits on 1 card. That solved the problem for me, without the FLASH_ATTENTION environment (like i said that setting increases about 30% of the memory on most regular models)
Author
Owner

@shihkauskas commented on GitHub (Mar 3, 2026):

OLLAMA_FLASH_ATTENTION=false

work with ollama version is 0.17.0

<!-- gh-comment-id:3990216438 --> @shihkauskas commented on GitHub (Mar 3, 2026): OLLAMA_FLASH_ATTENTION=false work with ollama version is 0.17.0
Author
Owner

@telunyang commented on GitHub (Mar 9, 2026):

In my case, after setting OLLAMA_FLASH_ATTENTION=false, the PROCESSOR using gpt-oss:20b runs at 100% CPU. Has anyone encountered this issue?

<!-- gh-comment-id:4023558575 --> @telunyang commented on GitHub (Mar 9, 2026): In my case, after setting `OLLAMA_FLASH_ATTENTION=false`, the PROCESSOR using `gpt-oss:20b` runs at `100% CPU`. Has anyone encountered this issue?
Author
Owner

@JuergenMS commented on GitHub (Mar 15, 2026):

After I upgraded to 0.18.0 I get the error again (even with FLASH_ATTENTION=false). Since I could not downgrade to 0.14.3-rc1 again I tried 0.14.3 and even 0.13.4 (where bert had not been included)
I am kind of lost and consider to move away from ollama altogether. Does any body know what else to do else?

<!-- gh-comment-id:4063062084 --> @JuergenMS commented on GitHub (Mar 15, 2026): After I upgraded to 0.18.0 I get the error again (even with FLASH_ATTENTION=false). Since I could not downgrade to 0.14.3-rc1 again I tried 0.14.3 and even 0.13.4 (where bert had not been included) I am kind of lost and consider to move away from ollama altogether. Does any body know what else to do else?
Author
Owner

@lukestanley commented on GitHub (Mar 18, 2026):

@JuergenMS I had the same problem with an older version. After upgrading to 0.18.1, the problem was solved for me for all tests I did. Without turning off Flash Attention.

<!-- gh-comment-id:4085142903 --> @lukestanley commented on GitHub (Mar 18, 2026): @JuergenMS I had the same problem with an older version. After upgrading to 0.18.1, the problem was solved for me for all tests I did. Without turning off Flash Attention.
Author
Owner

@JuergenMS commented on GitHub (Mar 18, 2026):

@lukestanley Thank you for your comment.
0.18.1. is working now with flash attention=false., however, it still produces the error if true.

<!-- gh-comment-id:4086208348 --> @JuergenMS commented on GitHub (Mar 18, 2026): @lukestanley Thank you for your comment. 0.18.1. is working now with flash attention=false., however, it still produces the error if true.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#34697