[PR #10081] [MERGED] model: fix tokenization issues with spm tokenizer #59840

Closed
opened 2026-04-29 14:46:07 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/10081
Author: @jmorganca
Created: 4/2/2025
Status: Merged
Merged: 4/2/2025
Merged by: @jmorganca

Base: mainHead: jmorganca/spm


📝 Commits (10+)

📊 Changes

5 files changed (+171 additions, -109 deletions)

View changed files

📝 model/models/gemma2/model.go (+0 -1)
📝 model/models/gemma3/model.go (+0 -1)
📝 model/models/gemma3/model_text.go (+0 -1)
📝 model/process_text_spm.go (+114 -103)
📝 model/process_text_spm_test.go (+57 -3)

📄 Description

This PR fixes inconsistencies in the SPM tokenizer for Gemma 3

Note: while this fixes tokenizing certain utf-8 characters (e.g. certain Korean characters) it doesn't fix de-tokenizing them yet


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/10081 **Author:** [@jmorganca](https://github.com/jmorganca) **Created:** 4/2/2025 **Status:** ✅ Merged **Merged:** 4/2/2025 **Merged by:** [@jmorganca](https://github.com/jmorganca) **Base:** `main` ← **Head:** `jmorganca/spm` --- ### 📝 Commits (10+) - [`5ae3cf1`](https://github.com/ollama/ollama/commit/5ae3cf1a26867acb92549ed7d7742d3880ba7b9d) model: fix issues with spm tokenizer - [`9bb6f22`](https://github.com/ollama/ollama/commit/9bb6f22ecacf04643fe177a42679240cdf2adf12) updates to lower changes - [`c104d7f`](https://github.com/ollama/ollama/commit/c104d7fb679f5c08be6a9b29bc6eb789cf866510) revert differences - [`9864032`](https://github.com/ollama/ollama/commit/9864032deffba943b53085fef7885dc4f4d3ad66) move code - [`2b03c22`](https://github.com/ollama/ollama/commit/2b03c22978875bc95364503d5cd14d232351ccac) less diff - [`5f1b3db`](https://github.com/ollama/ollama/commit/5f1b3db37dc831cce7e842fffd8dc957b33546b5) lower diff - [`fa14508`](https://github.com/ollama/ollama/commit/fa145083615b545d1fc5eae6f92f49501d811f63) less diff - [`e192573`](https://github.com/ollama/ollama/commit/e192573cb890cd952abd5e1e8918524ba9db61bc) progress - [`493c882`](https://github.com/ollama/ollama/commit/493c88211e5251a912ac552657bc19bee4c94ea0) wip - [`0ebcce2`](https://github.com/ollama/ollama/commit/0ebcce250fdbd4d59387bb51aa8215cf1db4f17d) less diff ### 📊 Changes **5 files changed** (+171 additions, -109 deletions) <details> <summary>View changed files</summary> 📝 `model/models/gemma2/model.go` (+0 -1) 📝 `model/models/gemma3/model.go` (+0 -1) 📝 `model/models/gemma3/model_text.go` (+0 -1) 📝 `model/process_text_spm.go` (+114 -103) 📝 `model/process_text_spm_test.go` (+57 -3) </details> ### 📄 Description This PR fixes inconsistencies in the SPM tokenizer for Gemma 3 Note: while this fixes tokenizing certain utf-8 characters (e.g. certain Korean characters) it doesn't fix de-tokenizing them yet --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 14:46:07 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#59840