[PR #15627] fix: preserve strip_accents preprocessing in BERT tokenizer conversion #25728

Open
opened 2026-04-19 18:24:42 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15627
Author: @MasterOfFeelingFish
Created: 4/16/2026
Status: 🔄 Open

Base: mainHead: clawoss/fix/bert-strip-accents


📝 Commits (1)

  • bc4fa62 fix: preserve strip_accents preprocessing in BERT tokenizer conversion

📊 Changes

7 files changed (+43 additions, -9 deletions)

View changed files

📝 convert/convert_bert.go (+1 -0)
📝 convert/convert_nomicbert.go (+1 -0)
📝 convert/tokenizer.go (+11 -2)
📝 model/models/bert/embed.go (+1 -1)
📝 model/models/nomicbert/model.go (+1 -0)
📝 tokenizer/wordpiece.go (+26 -5)
📝 tokenizer/wordpiece_test.go (+2 -1)

📄 Description

Summary

BERT-derived embedding models produce incorrect embeddings for non-ASCII text because the strip_accents preprocessing step is dropped during GGUF conversion.

Root Cause

The HuggingFace BasicTokenizer used by BERT models applies NFD normalization and strips combining diacritical marks (accents) before WordPiece tokenization. This preprocessing is essential - without it, words like Hokkaidō and Éire fail to match their ASCII equivalents (Hokkaido and Eire) because the diacritics cause the words to tokenize to [UNK].

Changes

  1. convert/tokenizer.go: Read strip_accents field from okenizer_config.json during tokenizer parsing
  2. convert/convert_bert.go & convert/convert_nomicbert.go: Store strip_accents in GGUF metadata
  3. tokenizer/wordpiece.go:
    • Add stripAccents field to WordPiece struct
    • Add stripAccents() helper function to remove combining diacritical marks (U+0300-U+036F)
    • Apply accent stripping in Encode() when stripAccents is enabled
  4. model/models/bert/embed.go & model/models/nomicbert/model.go: Pass strip_accents config to NewWordPiece()

Impact

This fix ensures that models like mxbai-embed-large,
omic-embed-text, and ll-minilm correctly handle non-ASCII text, matching the behavior of the original HuggingFace models.

Fixes #15609


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15627 **Author:** [@MasterOfFeelingFish](https://github.com/MasterOfFeelingFish) **Created:** 4/16/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `clawoss/fix/bert-strip-accents` --- ### 📝 Commits (1) - [`bc4fa62`](https://github.com/ollama/ollama/commit/bc4fa6290d858eccdccc5a54fcbbea256a799a59) fix: preserve strip_accents preprocessing in BERT tokenizer conversion ### 📊 Changes **7 files changed** (+43 additions, -9 deletions) <details> <summary>View changed files</summary> 📝 `convert/convert_bert.go` (+1 -0) 📝 `convert/convert_nomicbert.go` (+1 -0) 📝 `convert/tokenizer.go` (+11 -2) 📝 `model/models/bert/embed.go` (+1 -1) 📝 `model/models/nomicbert/model.go` (+1 -0) 📝 `tokenizer/wordpiece.go` (+26 -5) 📝 `tokenizer/wordpiece_test.go` (+2 -1) </details> ### 📄 Description ## Summary BERT-derived embedding models produce incorrect embeddings for non-ASCII text because the strip_accents preprocessing step is dropped during GGUF conversion. ### Root Cause The HuggingFace BasicTokenizer used by BERT models applies NFD normalization and strips combining diacritical marks (accents) before WordPiece tokenization. This preprocessing is essential - without it, words like Hokkaidō and Éire fail to match their ASCII equivalents (Hokkaido and Eire) because the diacritics cause the words to tokenize to [UNK]. ### Changes 1. **convert/tokenizer.go**: Read strip_accents field from okenizer_config.json during tokenizer parsing 2. **convert/convert_bert.go** & **convert/convert_nomicbert.go**: Store strip_accents in GGUF metadata 3. **tokenizer/wordpiece.go**: - Add stripAccents field to WordPiece struct - Add stripAccents() helper function to remove combining diacritical marks (U+0300-U+036F) - Apply accent stripping in Encode() when stripAccents is enabled 4. **model/models/bert/embed.go** & **model/models/nomicbert/model.go**: Pass strip_accents config to NewWordPiece() ### Impact This fix ensures that models like mxbai-embed-large, omic-embed-text, and ll-minilm correctly handle non-ASCII text, matching the behavior of the original HuggingFace models. Fixes #15609 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 18:24:43 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#25728