[PR #15232] [MERGED] tokenizer: add byte fallback for SentencePiece BPE encoding #77383

Closed
opened 2026-05-05 10:03:31 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15232
Author: @dhiltgen
Created: 4/2/2026
Status: Merged
Merged: 4/2/2026
Merged by: @dhiltgen

Base: mainHead: bpe-byte-fallback


📝 Commits (2)

  • cc44f2d tokenizer: add byte fallback for SentencePiece BPE encoding
  • 0d569d2 tokenizer fixes

📊 Changes

3 files changed (+513 additions, -49 deletions)

View changed files

model/models/gemma4/tokenizer_reference_test.go (+341 -0)
📝 tokenizer/bytepairencoding.go (+55 -14)
📝 tokenizer/bytepairencoding_test.go (+117 -35)

📄 Description

When BPE merging produces tokens not in the vocabulary, fall back to encoding each UTF-8 byte as <0xHH> byte tokens instead of silently dropping the character. Also teach Decode to convert <0xHH> tokens back to raw bytes.

Fixes #15229, fixes #15231


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15232 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 4/2/2026 **Status:** ✅ Merged **Merged:** 4/2/2026 **Merged by:** [@dhiltgen](https://github.com/dhiltgen) **Base:** `main` ← **Head:** `bpe-byte-fallback` --- ### 📝 Commits (2) - [`cc44f2d`](https://github.com/ollama/ollama/commit/cc44f2d9f8ddff6e24dd8198dcd9d5f8e63f3f32) tokenizer: add byte fallback for SentencePiece BPE encoding - [`0d569d2`](https://github.com/ollama/ollama/commit/0d569d2a399c2098ca5137a2831916cf90746ede) tokenizer fixes ### 📊 Changes **3 files changed** (+513 additions, -49 deletions) <details> <summary>View changed files</summary> ➕ `model/models/gemma4/tokenizer_reference_test.go` (+341 -0) 📝 `tokenizer/bytepairencoding.go` (+55 -14) 📝 `tokenizer/bytepairencoding_test.go` (+117 -35) </details> ### 📄 Description When BPE merging produces tokens not in the vocabulary, fall back to encoding each UTF-8 byte as <0xHH> byte tokens instead of silently dropping the character. Also teach Decode to convert <0xHH> tokens back to raw bytes. Fixes #15229, fixes #15231 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 10:03:31 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#77383