[PR #12325] [MERGED] multi-regexp pretokenizer #45033

Closed
opened 2026-04-25 00:43:28 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/12325
Author: @mxyng
Created: 9/18/2025
Status: Merged
Merged: 9/23/2025
Merged by: @mxyng

Base: mainHead: mxyng/pretokenizer


📝 Commits (1)

  • 8eaf3e1 multi-regexp pretokenizer

📊 Changes

12 files changed (+124 additions, -34 deletions)

View changed files

📝 model/bytepairencoding.go (+45 -9)
📝 model/bytepairencoding_test.go (+39 -1)
📝 model/models/gptoss/model.go (+9 -11)
📝 model/models/llama/model.go (+24 -4)
📝 model/models/llama4/model.go (+1 -2)
📝 model/models/mistral3/model.go (+1 -1)
📝 model/models/mllama/model.go (+1 -1)
📝 model/models/qwen2/model.go (+1 -1)
📝 model/models/qwen25vl/model.go (+1 -1)
📝 model/models/qwen3/embed.go (+1 -1)
📝 model/models/qwen3/model.go (+1 -1)
📝 sample/samplers_test.go (+0 -1)

📄 Description

this change modifies the bpe tokenizer to accept multiple regular expressions which is crucial for models such as deepseek


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/12325 **Author:** [@mxyng](https://github.com/mxyng) **Created:** 9/18/2025 **Status:** ✅ Merged **Merged:** 9/23/2025 **Merged by:** [@mxyng](https://github.com/mxyng) **Base:** `main` ← **Head:** `mxyng/pretokenizer` --- ### 📝 Commits (1) - [`8eaf3e1`](https://github.com/ollama/ollama/commit/8eaf3e1ad6eebb2214a06ca2fc7cb9e8e03d70ee) multi-regexp pretokenizer ### 📊 Changes **12 files changed** (+124 additions, -34 deletions) <details> <summary>View changed files</summary> 📝 `model/bytepairencoding.go` (+45 -9) 📝 `model/bytepairencoding_test.go` (+39 -1) 📝 `model/models/gptoss/model.go` (+9 -11) 📝 `model/models/llama/model.go` (+24 -4) 📝 `model/models/llama4/model.go` (+1 -2) 📝 `model/models/mistral3/model.go` (+1 -1) 📝 `model/models/mllama/model.go` (+1 -1) 📝 `model/models/qwen2/model.go` (+1 -1) 📝 `model/models/qwen25vl/model.go` (+1 -1) 📝 `model/models/qwen3/embed.go` (+1 -1) 📝 `model/models/qwen3/model.go` (+1 -1) 📝 `sample/samplers_test.go` (+0 -1) </details> ### 📄 Description this change modifies the bpe tokenizer to accept multiple regular expressions which is crucial for models such as deepseek --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 00:43:28 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#45033