[GH-ISSUE #15609] BERT-derived embedding models produce incorrect embeddings for non-ASCII text (strip_accents preprocessing dropped in gguf conversion) #56475

Open
opened 2026-04-29 10:52:32 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @EmmaLeonhart on GitHub (Apr 15, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15609

What is the issue?

All three BERT-derived embedding models I tested in Ollama produce incorrect embeddings for text containing combining diacritics, while the same models via the HuggingFace transformers pipeline behave correctly. The defect is not model-specific — it appears to be a systematic loss of BasicTokenizer's strip_accents preprocessing somewhere in the HF→gguf conversion path (likely in llama.cpp's convert_hf_to_gguf.py, surfaced to end-users via Ollama).

Unrelated words containing diacritics collapse to cosine similarity ≈ 1.0 because they all tokenize to [CLS] [UNK] [SEP]. Same-word ASCII-vs-diacritic pairs (e.g. HokkaidōHokkaido) no longer cluster.

This affects every production RAG/search/clustering deployment using these models on any multilingual or Unicode-containing corpus.

Affected models (verified)

Model Same-word diac↔ASCII (should be high) Unrelated diac↔diac (should be low) ASCII control Failure mode
mxbai-embed-large 0.511 0.904 0.513 Full UNK collapse
nomic-embed-text 0.888 0.992 0.426 Diacritic attractor
all-minilm 0.240 0.875 0.214 Broken ASCII equivalence + attractor

The "Unrelated diac↔diac" column is the headline: unrelated words like Hokkaidō and Éire are embedded as near-identical vectors.

Reproduction

Minimal repro (Python, requires ollama package and the models pulled locally):

import ollama, numpy as np

def embed(model, text):
    r = ollama.embeddings(model=model, prompt=text)
    v = np.array(r["embedding"])
    return v / np.linalg.norm(v)

def cos(a, b): return float(a @ b)

for model in ["mxbai-embed-large", "nomic-embed-text", "all-minilm"]:
    h_diac  = embed(model, "Hokkaidō")
    h_ascii = embed(model, "Hokkaido")
    eire    = embed(model, "Éire")
    print(f"{model}: Hokkaidō↔Hokkaido={cos(h_diac, h_ascii):.3f}  "
          f"Hokkaidō↔Éire={cos(h_diac, eire):.3f}")

Expected: same-word pair ≈ 1.0, unrelated pair ≈ low.
Actual: unrelated pair ≈ 0.9–1.0.

Full verification pipeline and 147k-pair empirical analysis:

Likely root cause

The upstream HuggingFace tokenizers for these models use BasicTokenizer with do_lower_case=True, which (by default) applies NFD normalization and strips combining diacritical marks before WordPiece. Verified directly against mixedbread-ai/mxbai-embed-large-v1:

Hokkaidō  ->  ['hokkaido']   # identical to Hokkaido
Éire      ->  ['e', '##ire'] # identical to Eire
Zürich    ->  ['zurich']     # identical to Zurich

The gguf-converted tokenizer used by Ollama does not appear to apply this preprocessing. Raw diacritics hit WordPiece, miss the vocab, and fall back to [UNK]. Because [UNK] is a fixed token, every short diacritic-only string that can't be WordPiece-split produces [CLS] [UNK] [SEP], which embeds to a single attractor point.

The fix is almost certainly in llama.cpp/convert_hf_to_gguf.py — preserving the BasicTokenizer.strip_accents behavior through conversion. I'd expect it to require writing a flag into the gguf metadata and honoring it at tokenization time, but I haven't traced the conversion code in detail and the maintainers will know better.

Scope

Not tested exhaustively, but the mechanism implies any BERT-derived embedding model converted via this pipeline is affected. The three models above cover the most common production choices for local RAG.

Relevant log output


OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.17.1

Originally created by @EmmaLeonhart on GitHub (Apr 15, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15609 ### What is the issue? All three BERT-derived embedding models I tested in Ollama produce incorrect embeddings for text containing combining diacritics, while the same models via the HuggingFace `transformers` pipeline behave correctly. The defect is not model-specific — it appears to be a systematic loss of `BasicTokenizer`'s `strip_accents` preprocessing somewhere in the HF→gguf conversion path (likely in `llama.cpp`'s `convert_hf_to_gguf.py`, surfaced to end-users via Ollama). Unrelated words containing diacritics collapse to cosine similarity ≈ 1.0 because they all tokenize to `[CLS] [UNK] [SEP]`. Same-word ASCII-vs-diacritic pairs (e.g. `Hokkaidō` ↔ `Hokkaido`) no longer cluster. This affects every production RAG/search/clustering deployment using these models on any multilingual or Unicode-containing corpus. ## Affected models (verified) | Model | Same-word diac↔ASCII (should be high) | Unrelated diac↔diac (should be low) | ASCII control | Failure mode | |---|---|---|---|---| | `mxbai-embed-large` | 0.511 | **0.904** | 0.513 | Full UNK collapse | | `nomic-embed-text` | 0.888 | **0.992** | 0.426 | Diacritic attractor | | `all-minilm` | 0.240 | **0.875** | 0.214 | Broken ASCII equivalence + attractor | The "Unrelated diac↔diac" column is the headline: unrelated words like `Hokkaidō` and `Éire` are embedded as near-identical vectors. ## Reproduction Minimal repro (Python, requires `ollama` package and the models pulled locally): ```python import ollama, numpy as np def embed(model, text): r = ollama.embeddings(model=model, prompt=text) v = np.array(r["embedding"]) return v / np.linalg.norm(v) def cos(a, b): return float(a @ b) for model in ["mxbai-embed-large", "nomic-embed-text", "all-minilm"]: h_diac = embed(model, "Hokkaidō") h_ascii = embed(model, "Hokkaido") eire = embed(model, "Éire") print(f"{model}: Hokkaidō↔Hokkaido={cos(h_diac, h_ascii):.3f} " f"Hokkaidō↔Éire={cos(h_diac, eire):.3f}") ``` Expected: same-word pair ≈ 1.0, unrelated pair ≈ low. Actual: unrelated pair ≈ 0.9–1.0. Full verification pipeline and 147k-pair empirical analysis: - Repro script: https://github.com/emmaleonhart/latent-space-cartography/blob/main/scripts/verify_tokenizer_divergence.py - Writeup with full tables + mechanism walkthrough: https://emmaleonhart.github.io/latent-space-cartography/ ## Likely root cause The upstream HuggingFace tokenizers for these models use `BasicTokenizer` with `do_lower_case=True`, which (by default) applies NFD normalization and strips combining diacritical marks before WordPiece. Verified directly against `mixedbread-ai/mxbai-embed-large-v1`: ``` Hokkaidō -> ['hokkaido'] # identical to Hokkaido Éire -> ['e', '##ire'] # identical to Eire Zürich -> ['zurich'] # identical to Zurich ``` The gguf-converted tokenizer used by Ollama does not appear to apply this preprocessing. Raw diacritics hit WordPiece, miss the vocab, and fall back to `[UNK]`. Because `[UNK]` is a fixed token, every short diacritic-only string that can't be WordPiece-split produces `[CLS] [UNK] [SEP]`, which embeds to a single attractor point. The fix is almost certainly in `llama.cpp/convert_hf_to_gguf.py` — preserving the `BasicTokenizer.strip_accents` behavior through conversion. I'd expect it to require writing a flag into the gguf metadata and honoring it at tokenization time, but I haven't traced the conversion code in detail and the maintainers will know better. ## Scope Not tested exhaustively, but the mechanism implies any BERT-derived embedding model converted via this pipeline is affected. The three models above cover the most common production choices for local RAG. ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.17.1
GiteaMirror added the bug label 2026-04-29 10:52:32 -05:00
Author
Owner

@Mati0kez commented on GitHub (Apr 16, 2026):

The root cause seems to be in the HF→gguf conversion path — BasicTokenizer.strip_accents preprocessing is being dropped. Looking into whether this is fixable in llama.cpp's convert_hf_to_gguf.py.

<!-- gh-comment-id:4262397769 --> @Mati0kez commented on GitHub (Apr 16, 2026): The root cause seems to be in the HF→gguf conversion path — BasicTokenizer.strip_accents preprocessing is being dropped. Looking into whether this is fixable in llama.cpp's convert_hf_to_gguf.py.
Author
Owner

@MasterOfFeelingFish commented on GitHub (Apr 16, 2026):

Looking into this — the strip_accents preprocessing dropout in gguf conversion is likely fixable. Happy to submit a fix.

<!-- gh-comment-id:4262696681 --> @MasterOfFeelingFish commented on GitHub (Apr 16, 2026): Looking into this — the strip_accents preprocessing dropout in gguf conversion is likely fixable. Happy to submit a fix.
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15609
Analyzed: 2026-04-18T18:19:23.517307

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274304796 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15609 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15609 **Analyzed**: 2026-04-18T18:19:23.517307 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56475