[PR #14327] [MERGED] consolidate the tokenizer #40495

Closed
opened 2026-04-23 01:23:12 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14327
Author: @pdevine
Created: 2/19/2026
Status: Merged
Merged: 2/19/2026
Merged by: @pdevine

Base: mainHead: pdevine/tokenizer-consolidation


📝 Commits (1)

  • 4500d33 consolidate the tokenizer

📊 Changes

18 files changed (+1807 additions, -16 deletions)

View changed files

📝 x/mlxrunner/model/base/base.go (+1 -1)
📝 x/mlxrunner/pipeline.go (+1 -10)
📝 x/mlxrunner/runner.go (+1 -1)
x/mlxrunner/utf8_buffer.go (+47 -0)
x/mlxrunner/utf8_buffer_test.go (+46 -0)
📝 x/models/gemma3/gemma3.go (+1 -1)
📝 x/models/glm4_moe_lite/glm4_moe_lite.go (+1 -1)
📝 x/models/llama/llama.go (+1 -1)
📝 x/models/qwen3/qwen3.go (+1 -1)
x/tokenizer/tokenizer.go (+108 -0)
x/tokenizer/tokenizer_benchmark_test.go (+251 -0)
x/tokenizer/tokenizer_bpe.go (+175 -0)
x/tokenizer/tokenizer_correctness_test.go (+137 -0)
x/tokenizer/tokenizer_decode.go (+56 -0)
x/tokenizer/tokenizer_encode.go (+289 -0)
x/tokenizer/tokenizer_ggml_parity_test.go (+207 -0)
x/tokenizer/tokenizer_load.go (+458 -0)
x/tokenizer/tokenizer_load_test.go (+26 -0)

📄 Description

This change adds a new x/tokenizer package which includes:

  • New BPE and SentencePiece tokenizers
  • Removing the dependency on the imagegen tokenizers
  • Fixes to multibyte decoding in the pipeline
  • Various correctness and benchmark tests

Not included in this PR is the WordPiece tokenizer for BERT models which will be added when we add embedding models. The imagegen tokenizers will also be removed in a follow-up PR.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14327 **Author:** [@pdevine](https://github.com/pdevine) **Created:** 2/19/2026 **Status:** ✅ Merged **Merged:** 2/19/2026 **Merged by:** [@pdevine](https://github.com/pdevine) **Base:** `main` ← **Head:** `pdevine/tokenizer-consolidation` --- ### 📝 Commits (1) - [`4500d33`](https://github.com/ollama/ollama/commit/4500d33f7d0e42a3ab5eedc96e9cc8a3d35acd85) consolidate the tokenizer ### 📊 Changes **18 files changed** (+1807 additions, -16 deletions) <details> <summary>View changed files</summary> 📝 `x/mlxrunner/model/base/base.go` (+1 -1) 📝 `x/mlxrunner/pipeline.go` (+1 -10) 📝 `x/mlxrunner/runner.go` (+1 -1) ➕ `x/mlxrunner/utf8_buffer.go` (+47 -0) ➕ `x/mlxrunner/utf8_buffer_test.go` (+46 -0) 📝 `x/models/gemma3/gemma3.go` (+1 -1) 📝 `x/models/glm4_moe_lite/glm4_moe_lite.go` (+1 -1) 📝 `x/models/llama/llama.go` (+1 -1) 📝 `x/models/qwen3/qwen3.go` (+1 -1) ➕ `x/tokenizer/tokenizer.go` (+108 -0) ➕ `x/tokenizer/tokenizer_benchmark_test.go` (+251 -0) ➕ `x/tokenizer/tokenizer_bpe.go` (+175 -0) ➕ `x/tokenizer/tokenizer_correctness_test.go` (+137 -0) ➕ `x/tokenizer/tokenizer_decode.go` (+56 -0) ➕ `x/tokenizer/tokenizer_encode.go` (+289 -0) ➕ `x/tokenizer/tokenizer_ggml_parity_test.go` (+207 -0) ➕ `x/tokenizer/tokenizer_load.go` (+458 -0) ➕ `x/tokenizer/tokenizer_load_test.go` (+26 -0) </details> ### 📄 Description This change adds a new x/tokenizer package which includes: * New BPE and SentencePiece tokenizers * Removing the dependency on the imagegen tokenizers * Fixes to multibyte decoding in the pipeline * Various correctness and benchmark tests Not included in this PR is the WordPiece tokenizer for BERT models which will be added when we add embedding models. The imagegen tokenizers will also be removed in a follow-up PR. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-23 01:23:12 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#40495