[PR #13091] [MERGED] fix(tokenizer): add special tokens to empty inputs #24612

Closed
opened 2026-04-19 17:41:06 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13091
Author: @mxyng
Created: 11/14/2025
Status: Merged
Merged: 11/18/2025
Merged by: @mxyng

Base: mainHead: mxyng/fix-special-tokens


📝 Commits (1)

  • 6ab0b73 fix(tokenizer): add special tokens to empty inputs

📊 Changes

5 files changed (+98 additions, -7 deletions)

View changed files

📝 model/bytepairencoding.go (+1 -1)
📝 model/sentencepiece.go (+1 -1)
📝 model/vocabulary.go (+2 -2)
📝 model/vocabulary_test.go (+93 -2)
📝 model/wordpiece.go (+1 -1)

📄 Description

this change allows special tokens to be added to empty input sequences. specifically, this addresses the case where an image is provided to a model without a template.

previously, the runner preprocessing will regexp split the input into an empty part with an extracted image tag and the rest of the input. since the empty part will produce zero tokens during encoding, special tokens are incorrectly skipped


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13091 **Author:** [@mxyng](https://github.com/mxyng) **Created:** 11/14/2025 **Status:** ✅ Merged **Merged:** 11/18/2025 **Merged by:** [@mxyng](https://github.com/mxyng) **Base:** `main` ← **Head:** `mxyng/fix-special-tokens` --- ### 📝 Commits (1) - [`6ab0b73`](https://github.com/ollama/ollama/commit/6ab0b73444e87de188d135f9070fe39dc3347494) fix(tokenizer): add special tokens to empty inputs ### 📊 Changes **5 files changed** (+98 additions, -7 deletions) <details> <summary>View changed files</summary> 📝 `model/bytepairencoding.go` (+1 -1) 📝 `model/sentencepiece.go` (+1 -1) 📝 `model/vocabulary.go` (+2 -2) 📝 `model/vocabulary_test.go` (+93 -2) 📝 `model/wordpiece.go` (+1 -1) </details> ### 📄 Description this change allows special tokens to be added to empty input sequences. specifically, this addresses the case where an image is provided to a model without a template. previously, the runner preprocessing will regexp split the input into an empty part with an extracted image tag and the rest of the input. since the empty part will produce zero tokens during encoding, special tokens are incorrectly skipped --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 17:41:06 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#24612