[PR #14218] tokenizer: merge pre-tok REs to prevent multiple matches #40448

Open
opened 2026-04-23 01:20:31 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14218
Author: @rick-github
Created: 2/12/2026
Status: 🔄 Open

Base: mainHead: bpe


📝 Commits (1)

  • 5c927a7 bpe: merge pre-tok REs to prevent multple matches

📊 Changes

1 file changed (+2 additions, -0 deletions)

View changed files

📝 tokenizer/bytepairencoding.go (+2 -0)

📄 Description

NewBytePairEncoding() takes a list of REs for pre-tokenization. If the REs are not mutually exclusive, this causes token duplication. Models affected:

deepseek-v3.1:671b
deepseek-r1:671b
cogito-2.1:671b
r1-1776:671b
deepseek-ocr

If ported to the ollama engine, the following models would also be affected:

granite-code
granite3-dense
granite3-moe
granite3-guardian
granite3.1-dense
granite3.1-moe
granite-3.2
granite3.3

Test:

$ ollama run deepseek-v3.1:671b '"1 2 3 4" repeat the previous string'
Thinking...
First, the user said: "1" then "1 2" then "1 "1 2 3" but that seems messy. Let me read carefully.

The user said: "1" then "1 2" then "1 "1 2 3" but that doesn't make sense. Perhaps it's a pattern.

Perhaps it's a sequence of strings. Let me see the user's message: "1" then "1 2" then "1 "1 2 3"
but that seems off. Maybe it's "1" then "1 2" then "1 2 3" but the user wrote "1 "1 2 3" which might
be a typo.
...

This PR merges the list into a single RE with alternation. An alternative approach would be to modify the call to NewBytePairEncoding() in the model.go file of affected models to pass a pre-concatenated string, or modify the parsing in
split to exit early on a match.

Fixes: #13388


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14218 **Author:** [@rick-github](https://github.com/rick-github) **Created:** 2/12/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `bpe` --- ### 📝 Commits (1) - [`5c927a7`](https://github.com/ollama/ollama/commit/5c927a742af7023cbe3fe2c003dd6061225084cf) bpe: merge pre-tok REs to prevent multple matches ### 📊 Changes **1 file changed** (+2 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `tokenizer/bytepairencoding.go` (+2 -0) </details> ### 📄 Description NewBytePairEncoding() takes a list of REs for pre-tokenization. If the REs are not mutually exclusive, this causes token duplication. Models affected: deepseek-v3.1:671b deepseek-r1:671b cogito-2.1:671b r1-1776:671b deepseek-ocr If ported to the ollama engine, the following models would also be affected: granite-code granite3-dense granite3-moe granite3-guardian granite3.1-dense granite3.1-moe granite-3.2 granite3.3 Test: ```console $ ollama run deepseek-v3.1:671b '"1 2 3 4" repeat the previous string' Thinking... First, the user said: "1" then "1 2" then "1 "1 2 3" but that seems messy. Let me read carefully. The user said: "1" then "1 2" then "1 "1 2 3" but that doesn't make sense. Perhaps it's a pattern. Perhaps it's a sequence of strings. Let me see the user's message: "1" then "1 2" then "1 "1 2 3" but that seems off. Maybe it's "1" then "1 2" then "1 2 3" but the user wrote "1 "1 2 3" which might be a typo. ... ``` This PR merges the list into a single RE with alternation. An alternative approach would be to modify the call to NewBytePairEncoding() in the model.go file of affected models to pass a pre-concatenated string, or modify the parsing in [split](https://github.com/ollama/ollama/blob/379fd64fa837e51eddeda6c32f5c812d912a6751/tokenizer/bytepairencoding.go#L50) to exit early on a match. Fixes: #13388 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-23 01:20:31 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#40448