[GH-ISSUE #15620] feat: add XLM-R embedding support + SentencePiece Unigram tokenizer + embedding model fixes #72026

Open
opened 2026-05-05 03:21:38 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @CrispStrobe on GitHub (Apr 16, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15620

Problem

Ollama currently supports BERT and nomic-bert for embedding models, but several popular embedding model families are missing or have compatibility issues:

  1. XLM-R models (multilingual-e5-small, arctic-embed-l-v2, PIXIE-Rune) cannot be loaded — there's no xlmr architecture.
  2. SentencePiece Unigram tokenizer is missing — the existing SentencePiece tokenizer uses BPE-style pairwise merge, which produces incorrect tokenization for Unigram models. This affects multilingual models.
  3. BERT GELU variant — BERT uses exact GELU (erf-based) but the current implementation uses the tanh approximation, causing cos similarity to drop from 1.0 to ~0.996 vs HuggingFace.
  4. Nil-pointer crashes — Qwen3 models without QK-norm tensors (e.g. Jina v5) crash; Gemma3 embed models without Dense projection layers (e.g. Harrier-270M) crash.

Proposed Changes

1. Bugfixes (backwards compatible, no new dependencies)

  • model/models/qwen3/model.go: nil-guard for optional QK-norm (fixes crash with Jina v5 Nano and other Qwen-based models without QK-norm)
  • model/models/gemma3/embed.go: nil-guard for optional Dense projection layers (fixes crash with Harrier-270M)
  • model/models/bert/embed.go: use GELU_ERF instead of GELU (matches HuggingFace's exact GELU, improves cos from 0.996 to 1.000)

2. SentencePiece Unigram tokenizer

  • New tokenizer/sentencepiece_unigram.go: Viterbi DP tokenizer for SentencePiece Unigram models. The existing SentencePiece tokenizer uses greedy pairwise merge which is correct for BPE but wrong for Unigram models.

3. XLM-R embedding architecture

  • New model/models/xlmr/embed.go: XLM-RoBERTa encoder (like BERT but without type embeddings, with position offset, SentencePiece tokenizer)
  • model/models/bert/embed.go: extend tokenizer support to also accept "llama" (SentencePiece) and "gpt2" (BPE) in addition to "bert" (WordPiece)
  • model/models/models.go: register xlmr
  • fs/ggml/ggml.go: add xlmr to OllamaEngineRequired

Testing

Verified against HuggingFace sentence-transformers with 13 embedding models:

Model Arch Q8_0 cos vs HF
all-MiniLM-L6-v2 BERT 0.9998
gte-small BERT 0.9999
arctic-embed-xs BERT 0.9999
multilingual-e5-small BERT+SP 0.9999
arctic-embed-l-v2 XLM-R loads, L2-norm=1.0
PIXIE-Rune-v1 XLM-R cross-lingual OK
harrier-270m Gemma3 loads, L2-norm=1.0
jina-v5-nano Qwen3 loads, L2-norm=1.0
+ 5 more Qwen3 models Qwen3 all pass

All GGUFs available at huggingface.co/cstr.

Implementation

Branch: https://github.com/CrispStrobe/ollama/tree/feat/xlmr-embedding (7 files changed, 461 insertions, 4 deletions)

Happy to split into separate PRs (bugfixes / tokenizer / architecture) if preferred.

Originally created by @CrispStrobe on GitHub (Apr 16, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15620 ## Problem Ollama currently supports BERT and nomic-bert for embedding models, but several popular embedding model families are missing or have compatibility issues: 1. **XLM-R models** (multilingual-e5-small, arctic-embed-l-v2, PIXIE-Rune) cannot be loaded — there's no `xlmr` architecture. 2. **SentencePiece Unigram tokenizer** is missing — the existing SentencePiece tokenizer uses BPE-style pairwise merge, which produces incorrect tokenization for Unigram models. This affects multilingual models. 3. **BERT GELU variant** — BERT uses exact GELU (erf-based) but the current implementation uses the tanh approximation, causing cos similarity to drop from 1.0 to ~0.996 vs HuggingFace. 4. **Nil-pointer crashes** — Qwen3 models without QK-norm tensors (e.g. Jina v5) crash; Gemma3 embed models without Dense projection layers (e.g. Harrier-270M) crash. ## Proposed Changes ### 1. Bugfixes (backwards compatible, no new dependencies) - `model/models/qwen3/model.go`: nil-guard for optional QK-norm (fixes crash with Jina v5 Nano and other Qwen-based models without QK-norm) - `model/models/gemma3/embed.go`: nil-guard for optional Dense projection layers (fixes crash with Harrier-270M) - `model/models/bert/embed.go`: use `GELU_ERF` instead of `GELU` (matches HuggingFace's exact GELU, improves cos from 0.996 to 1.000) ### 2. SentencePiece Unigram tokenizer - New `tokenizer/sentencepiece_unigram.go`: Viterbi DP tokenizer for SentencePiece Unigram models. The existing `SentencePiece` tokenizer uses greedy pairwise merge which is correct for BPE but wrong for Unigram models. ### 3. XLM-R embedding architecture - New `model/models/xlmr/embed.go`: XLM-RoBERTa encoder (like BERT but without type embeddings, with position offset, SentencePiece tokenizer) - `model/models/bert/embed.go`: extend tokenizer support to also accept `"llama"` (SentencePiece) and `"gpt2"` (BPE) in addition to `"bert"` (WordPiece) - `model/models/models.go`: register xlmr - `fs/ggml/ggml.go`: add xlmr to OllamaEngineRequired ## Testing Verified against HuggingFace sentence-transformers with 13 embedding models: | Model | Arch | Q8_0 cos vs HF | |-------|------|-----------------| | all-MiniLM-L6-v2 | BERT | 0.9998 | | gte-small | BERT | 0.9999 | | arctic-embed-xs | BERT | 0.9999 | | multilingual-e5-small | BERT+SP | 0.9999 | | arctic-embed-l-v2 | XLM-R | loads, L2-norm=1.0 | | PIXIE-Rune-v1 | XLM-R | cross-lingual OK | | harrier-270m | Gemma3 | loads, L2-norm=1.0 | | jina-v5-nano | Qwen3 | loads, L2-norm=1.0 | | + 5 more Qwen3 models | Qwen3 | all pass | All GGUFs available at [huggingface.co/cstr](https://huggingface.co/cstr). ## Implementation Branch: https://github.com/CrispStrobe/ollama/tree/feat/xlmr-embedding (7 files changed, 461 insertions, 4 deletions) Happy to split into separate PRs (bugfixes / tokenizer / architecture) if preferred.
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15620
Analyzed: 2026-04-18T18:19:43.238656

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274305370 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15620 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15620 **Analyzed**: 2026-04-18T18:19:43.238656 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#72026