[PR #15162] [MERGED] tokenizer: add SentencePiece-style BPE support #25597

Closed
opened 2026-04-19 18:18:02 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15162
Author: @dhiltgen
Created: 3/31/2026
Status: Merged
Merged: 4/1/2026
Merged by: @dhiltgen

Base: mainHead: tokenizer_bpe


📝 Commits (2)

  • 0ec04e0 tokenizer: add SentencePiece-style BPE support
  • b837857 review comments

📊 Changes

2 files changed (+241 additions, -17 deletions)

View changed files

📝 tokenizer/bytepairencoding.go (+61 -17)
📝 tokenizer/bytepairencoding_test.go (+180 -0)

📄 Description

Add WithSentencePieceNormalizer option to BytePairEncoding for models that use BPE with SentencePiece-style space markers (space to/from U+2581).

NewBytePairEncoding is unchanged; the new NewBytePairEncodingWithOptions constructor accepts BPEOption functions. Decoding handles the reverse mapping of U+2581 back to spaces.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15162 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 3/31/2026 **Status:** ✅ Merged **Merged:** 4/1/2026 **Merged by:** [@dhiltgen](https://github.com/dhiltgen) **Base:** `main` ← **Head:** `tokenizer_bpe` --- ### 📝 Commits (2) - [`0ec04e0`](https://github.com/ollama/ollama/commit/0ec04e065e60a3b772c1c38425289009d9c32803) tokenizer: add SentencePiece-style BPE support - [`b837857`](https://github.com/ollama/ollama/commit/b8378578000bc0698c0ce9ff8b6f6bd56d2fd0e8) review comments ### 📊 Changes **2 files changed** (+241 additions, -17 deletions) <details> <summary>View changed files</summary> 📝 `tokenizer/bytepairencoding.go` (+61 -17) 📝 `tokenizer/bytepairencoding_test.go` (+180 -0) </details> ### 📄 Description Add WithSentencePieceNormalizer option to BytePairEncoding for models that use BPE with SentencePiece-style space markers (space to/from U+2581). NewBytePairEncoding is unchanged; the new NewBytePairEncodingWithOptions constructor accepts BPEOption functions. Decoding handles the reverse mapping of U+2581 back to spaces. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 18:18:02 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#25597