[PR #12417] [MERGED] parsers: fix unicode handling for qwen3-coder #19089

Closed
opened 2026-04-16 06:56:28 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/12417
Author: @drifkin
Created: 9/25/2025
Status: Merged
Merged: 9/25/2025
Merged by: @drifkin

Base: mainHead: drifkin/qwen3-coder-unicode


📝 Commits (1)

  • 05ba4ca parsers: fix unicode handling for qwen3-coder

📊 Changes

2 files changed (+231 additions, -4 deletions)

View changed files

📝 model/parsers/qwen3coder.go (+14 -4)
📝 model/parsers/qwen3coder_test.go (+217 -0)

📄 Description

When trimming whitespace at the end of every chunk, we were iterating backwards over the string byte-by-byte instead of rune-by-rune.

As an example of how this can cause corruption, suppose we have the multi-byte character ("\u2705"), which is represented in utf-8 as the three bytes 0xE2 0x9C 0x85. It happens that 0x85 is NEL, which passes unicode.IsSpace(). Because we were iterating byte-by-byte, this caused us to mistakenly slice in the middle of the rune, removing 0x85 and leaving 0xE2 0x9C, which beyond being the incorrect place to slice, is not even a valid utf-8 character.

trailingWhitespaceLen() was modified to count from the end in a rune-aware way. Tests with various multibyte unicode characters were also added.

Fixes: #12414


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/12417 **Author:** [@drifkin](https://github.com/drifkin) **Created:** 9/25/2025 **Status:** ✅ Merged **Merged:** 9/25/2025 **Merged by:** [@drifkin](https://github.com/drifkin) **Base:** `main` ← **Head:** `drifkin/qwen3-coder-unicode` --- ### 📝 Commits (1) - [`05ba4ca`](https://github.com/ollama/ollama/commit/05ba4ca1f4b356df50ed6eede0e2bcdc76b31fb8) parsers: fix unicode handling for qwen3-coder ### 📊 Changes **2 files changed** (+231 additions, -4 deletions) <details> <summary>View changed files</summary> 📝 `model/parsers/qwen3coder.go` (+14 -4) 📝 `model/parsers/qwen3coder_test.go` (+217 -0) </details> ### 📄 Description When trimming whitespace at the end of every chunk, we were iterating backwards over the string byte-by-byte instead of rune-by-rune. As an example of how this can cause corruption, suppose we have the multi-byte character ✅ (`"\u2705"`), which is represented in utf-8 as the three bytes `0xE2 0x9C 0x85`. It happens that `0x85` is NEL, which passes `unicode.IsSpace()`. Because we were iterating byte-by-byte, this caused us to mistakenly slice in the middle of the rune, removing `0x85` and leaving `0xE2 0x9C`, which beyond being the incorrect place to slice, is not even a valid utf-8 character. `trailingWhitespaceLen()` was modified to count from the end in a rune-aware way. Tests with various multibyte unicode characters were also added. Fixes: #12414 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 06:56:28 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#19089