[GH-ISSUE #15582] bge-m3 returns HTTP 500 with json: unsupported value: NaN when embedding certain markdown files. #72008

Open
opened 2026-05-05 03:17:55 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @TadMSTR on GitHub (Apr 14, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15582

What is the issue?

bge-m3 returns HTTP 500 with json: unsupported value: NaN when embedding certain markdown files.

The file is valid UTF-8 markdown (~1.9 KB) with standard YAML frontmatter. Other files in the same corpus embed without issue. The same file consistently triggers the error across restarts — it's deterministic, not a transient failure.

The error suggests the model produced a NaN value in the embedding vector that Ollama's JSON serializer cannot encode. This is model-level numerical instability for specific inputs, not a client or file issue.

Error:
ollama._types.ResponseError: failed to encode response: json: unsupported value: NaN (status code: 500)

Reproducer:

import ollama
client = ollama.Client(host="http://your-ollama-host")
with open("triggering_file.md") as f:
    text = f.read()
result = client.embed(model="bge-m3", input=[text])

Expected: valid embedding vector, or a descriptive error identifying the problematic input
Actual: HTTP 500, process/watcher crashes

Relevant log output

`failed to encode response: json: unsupported value: NaN (status code: 500)`

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.20.2

Originally created by @TadMSTR on GitHub (Apr 14, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15582 ### What is the issue? `bge-m3` returns HTTP 500 with `json: unsupported value: NaN` when embedding certain markdown files. The file is valid UTF-8 markdown (~1.9 KB) with standard YAML frontmatter. Other files in the same corpus embed without issue. The same file consistently triggers the error across restarts — it's deterministic, not a transient failure. The error suggests the model produced a NaN value in the embedding vector that Ollama's JSON serializer cannot encode. This is model-level numerical instability for specific inputs, not a client or file issue. **Error:** `ollama._types.ResponseError: failed to encode response: json: unsupported value: NaN (status code: 500)` **Reproducer:** ```python import ollama client = ollama.Client(host="http://your-ollama-host") with open("triggering_file.md") as f: text = f.read() result = client.embed(model="bge-m3", input=[text]) ``` Expected: valid embedding vector, or a descriptive error identifying the problematic input Actual: HTTP 500, process/watcher crashes ### Relevant log output ```shell `failed to encode response: json: unsupported value: NaN (status code: 500)` ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.20.2
GiteaMirror added the bug label 2026-05-05 03:17:55 -05:00
Author
Owner

@dominicx commented on GitHub (Apr 16, 2026):

also in 0.20.3

<!-- gh-comment-id:4258475183 --> @dominicx commented on GitHub (Apr 16, 2026): also in 0.20.3
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15582
Analyzed: 2026-04-18T18:19:27.909844

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274304917 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15582 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15582 **Analyzed**: 2026-04-18T18:19:27.909844 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Author
Owner

@jaredtrobinson-dotcom commented on GitHub (May 4, 2026):

Adding a diagnostic data point that I haven't seen in the thread: this is GPU-specific. Forcing CPU via num_gpu: 0 makes the failure go away on the same texts.

Repro

Beast: i7-12700K + RTX 4070 SUPER 12GB, ollama 0.23.0, NVIDIA driver via WSL2 / Windows host.

import requests
OLLAMA = "http://192.168.68.50:11434"

# Five sentences that deterministically NaN on bge-m3 with default (GPU) options:
TEXTS = [
    "Clawman uses Letta with persistent memory blocks and routes inference through Pi LiteLLM.",
    "The Eiffel Tower in Paris was completed in 1889 and stands 330 meters tall.",
    "Docker containers share the host kernel but isolate filesystems and processes.",
    "The speed of light in a vacuum is approximately 299,792,458 meters per second.",
    "Tokyo is the largest metropolitan area in the world by population.",
]

def probe(text, opts=None):
    p = {"model": "bge-m3:latest", "prompt": text}
    if opts: p["options"] = opts
    r = requests.post(f"{OLLAMA}/api/embeddings", json=p, timeout=60)
    return f"HTTP {r.status_code}"

for t in TEXTS:
    gpu = probe(t, {})                  # default
    cpu = probe(t, {"num_gpu": 0})      # force CPU
    print(f"  GPU={gpu}  CPU={cpu}  | {t[:60]}")

Output:

  GPU=HTTP 500  CPU=HTTP 200  | Clawman uses Letta with persistent memory blocks ...
  GPU=HTTP 500  CPU=HTTP 200  | The Eiffel Tower in Paris was completed in 1889 ...
  GPU=HTTP 500  CPU=HTTP 200  | Docker containers share the host kernel but iso ...
  GPU=HTTP 500  CPU=HTTP 200  | The speed of light in a vacuum is approximately ...
  GPU=HTTP 500  CPU=HTTP 200  | Tokyo is the largest metropolitan area in the wo ...

Bisection

Variable changed Result Conclusion
Same text 5× (deterministic check) All 500 not transient/race
/api/embeddings vs /api/embed both 500 not endpoint-specific
2249-char benign repeat ("The quick brown fox …" × 50) OK not a long-sequence overflow
nomic-embed-text:latest on the same failing texts clean ollama runtime is fine
qwen3-embedding:8b on the same failing texts clean ollama runtime is fine
hf.co/gpustack/bge-m3-GGUF:Q8_0 on the same failing texts identical NaN not F16-weight-quant
flash_attn: true / false both NaN not flash-attn-related
num_thread, num_ctx variations all NaN not thread/context-related
num_gpu: 0 clean 6/6 GPU CUDA path is the trigger
Clean A/B with forced reload between calls GPU 5/6 NaN, CPU 6/6 OK confirms

So:

  • Not the model — same weights produce valid embeddings on CPU.
  • Not the quant — Q8_0 from gpustack/bge-m3-GGUF fails identically to the default F16, and CPU works in either quant.
  • Not the ollama runtime — nomic-embed-text and qwen3-embedding:8b on the same daemon, same hardware, same texts all clean.
  • Yes the bge-m3 CUDA inference path. Most likely an F16 attention-softmax overflow → inf → inf−inf → NaN that the CPU path handles fine. Hits on certain attention-pattern inputs in 24-layer XLM-RoBERTa-derived BERT.

Workaround for users hitting this

Bake num_gpu: 0 into a Modelfile alias and route embedding traffic to the alias:

POST /api/create
{ "model": "bge-m3-cpu",
  "from":  "bge-m3:latest",
  "parameters": { "num_gpu": 0 } }

CPU bge-m3 throughput on this hardware (warm, single-thread default): 185 ms/chunk vs 109 ms/chunk on GPU when GPU works. 1.7× slower, but reliable. Acceptable for save-time embedding workloads (Letta archival memory, RAG indexing, etc.).

  • #14657 (bge-m3 only returns NaN on bitcoin whitepaper, other docs) — same model, different content, almost certainly the same root cause.
  • #14739 (server: handle NaN values in embedding responses) — defense-in-depth: the runtime should sanitize NaN to a defined error rather than 500-ing JSON encode.
  • #12921 (qwen3-embedding:8b-fp16 embedding failure) — different model, same class of GPU embedding NaN. Worth checking whether they share a kernel.

If a maintainer wants more bisection — e.g., a specific attention-layer dump or whether --batch-size 1 vs higher matters — happy to run more probes. The reproducer is fast (each call is sub-second on a warm model).

<!-- gh-comment-id:4375439353 --> @jaredtrobinson-dotcom commented on GitHub (May 4, 2026): Adding a diagnostic data point that I haven't seen in the thread: **this is GPU-specific.** Forcing CPU via `num_gpu: 0` makes the failure go away on the same texts. ## Repro Beast: i7-12700K + RTX 4070 SUPER 12GB, ollama 0.23.0, NVIDIA driver via WSL2 / Windows host. ```python import requests OLLAMA = "http://192.168.68.50:11434" # Five sentences that deterministically NaN on bge-m3 with default (GPU) options: TEXTS = [ "Clawman uses Letta with persistent memory blocks and routes inference through Pi LiteLLM.", "The Eiffel Tower in Paris was completed in 1889 and stands 330 meters tall.", "Docker containers share the host kernel but isolate filesystems and processes.", "The speed of light in a vacuum is approximately 299,792,458 meters per second.", "Tokyo is the largest metropolitan area in the world by population.", ] def probe(text, opts=None): p = {"model": "bge-m3:latest", "prompt": text} if opts: p["options"] = opts r = requests.post(f"{OLLAMA}/api/embeddings", json=p, timeout=60) return f"HTTP {r.status_code}" for t in TEXTS: gpu = probe(t, {}) # default cpu = probe(t, {"num_gpu": 0}) # force CPU print(f" GPU={gpu} CPU={cpu} | {t[:60]}") ``` Output: ``` GPU=HTTP 500 CPU=HTTP 200 | Clawman uses Letta with persistent memory blocks ... GPU=HTTP 500 CPU=HTTP 200 | The Eiffel Tower in Paris was completed in 1889 ... GPU=HTTP 500 CPU=HTTP 200 | Docker containers share the host kernel but iso ... GPU=HTTP 500 CPU=HTTP 200 | The speed of light in a vacuum is approximately ... GPU=HTTP 500 CPU=HTTP 200 | Tokyo is the largest metropolitan area in the wo ... ``` ## Bisection | Variable changed | Result | Conclusion | |---|---|---| | Same text 5× (deterministic check) | All 500 | not transient/race | | `/api/embeddings` vs `/api/embed` | both 500 | not endpoint-specific | | 2249-char benign repeat (`"The quick brown fox …" × 50`) | OK | not a long-sequence overflow | | `nomic-embed-text:latest` on the same failing texts | clean | ollama runtime is fine | | `qwen3-embedding:8b` on the same failing texts | clean | ollama runtime is fine | | `hf.co/gpustack/bge-m3-GGUF:Q8_0` on the same failing texts | identical NaN | not F16-weight-quant | | `flash_attn: true / false` | both NaN | not flash-attn-related | | `num_thread`, `num_ctx` variations | all NaN | not thread/context-related | | **`num_gpu: 0`** | **clean 6/6** | **GPU CUDA path is the trigger** | | Clean A/B with forced reload between calls | GPU 5/6 NaN, CPU 6/6 OK | confirms | So: - **Not** the model — same weights produce valid embeddings on CPU. - **Not** the quant — Q8_0 from `gpustack/bge-m3-GGUF` fails identically to the default F16, and CPU works in either quant. - **Not** the ollama runtime — `nomic-embed-text` and `qwen3-embedding:8b` on the same daemon, same hardware, same texts all clean. - **Yes** the bge-m3 CUDA inference path. Most likely an F16 attention-softmax overflow → inf → inf−inf → NaN that the CPU path handles fine. Hits on certain attention-pattern inputs in 24-layer XLM-RoBERTa-derived BERT. ## Workaround for users hitting this Bake `num_gpu: 0` into a Modelfile alias and route embedding traffic to the alias: ``` POST /api/create { "model": "bge-m3-cpu", "from": "bge-m3:latest", "parameters": { "num_gpu": 0 } } ``` CPU bge-m3 throughput on this hardware (warm, single-thread default): **185 ms/chunk vs 109 ms/chunk on GPU when GPU works**. 1.7× slower, but reliable. Acceptable for save-time embedding workloads (Letta archival memory, RAG indexing, etc.). ## Possibly related - #14657 (bge-m3 only returns NaN on bitcoin whitepaper, other docs) — same model, different content, almost certainly the same root cause. - #14739 (server: handle NaN values in embedding responses) — defense-in-depth: the runtime should sanitize NaN to a defined error rather than 500-ing JSON encode. - #12921 (qwen3-embedding:8b-fp16 embedding failure) — different model, same class of GPU embedding NaN. Worth checking whether they share a kernel. If a maintainer wants more bisection — e.g., a specific attention-layer dump or whether `--batch-size 1` vs higher matters — happy to run more probes. The reproducer is fast (each call is sub-second on a warm model).
Author
Owner

@seppel123 commented on GitHub (May 5, 2026):

Will this be fixed soon or do i have to change embedding model?

Ollama embed error (500): failed to encode response: json: unsupported value: NaN

Ollama Version: 0,.22.0

<!-- gh-comment-id:4377241468 --> @seppel123 commented on GitHub (May 5, 2026): Will this be fixed soon or do i have to change embedding model? `Ollama embed error (500): failed to encode response: json: unsupported value: NaN` Ollama Version: 0,.22.0
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#72008