[PR #13500] fix: preserve partial UTF-8 bytes in logprobs API response #40111

Open
opened 2026-04-23 01:05:45 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13500
Author: @nathannewyen
Created: 12/16/2025
Status: 🔄 Open

Base: mainHead: fix/logprobs-utf8-bytes


📝 Commits (1)

  • 8dac29b fix: preserve partial UTF-8 bytes in logprobs API response

📊 Changes

6 files changed (+297 additions, -2 deletions)

View changed files

📝 llm/server.go (+5 -0)
📝 runner/common/logprob.go (+4 -0)
📝 runner/common/logprob_test.go (+135 -0)
📝 server/logprob.go (+20 -2)
server/logprob_test.go (+127 -0)
📝 server/routes_generate_test.go (+6 -0)

📄 Description

Summary

  • Fix issue where partial UTF-8 tokens returned wrong bytes in logprobs API
  • Store raw bytes before JSON encoding to preserve them through the transfer
  • Add tests verifying the fix

Problem

When logprobs are requested, partial UTF-8 tokens (like individual bytes of an emoji 😊) return [239, 191, 189] (replacement character) instead of actual bytes like [240].

Root Cause

Token strings get corrupted during JSON marshaling between runner and server - invalid UTF-8 is replaced with U+FFFD.

Solution

  1. Added Bytes []byte field to llm.TokenLogprob
  2. Populate bytes in CalculateLogprobs() before JSON encoding
  3. Use stored bytes in toAPILogprobs() instead of converting from corrupted string

Test plan

  • Added TestCalculateLogprobsPartialUTF8Bytes - verifies raw bytes preserved
  • Added TestLogprobBytesJSONRoundTrip - verifies JSON round-trip preservation
  • Added TestToAPILogprobsPreservesBytes - verifies API conversion uses stored bytes

Fixes #13497


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13500 **Author:** [@nathannewyen](https://github.com/nathannewyen) **Created:** 12/16/2025 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix/logprobs-utf8-bytes` --- ### 📝 Commits (1) - [`8dac29b`](https://github.com/ollama/ollama/commit/8dac29b9a9569d1ed8a6bb8c23226c55cb07c2e2) fix: preserve partial UTF-8 bytes in logprobs API response ### 📊 Changes **6 files changed** (+297 additions, -2 deletions) <details> <summary>View changed files</summary> 📝 `llm/server.go` (+5 -0) 📝 `runner/common/logprob.go` (+4 -0) 📝 `runner/common/logprob_test.go` (+135 -0) 📝 `server/logprob.go` (+20 -2) ➕ `server/logprob_test.go` (+127 -0) 📝 `server/routes_generate_test.go` (+6 -0) </details> ### 📄 Description ## Summary - Fix issue where partial UTF-8 tokens returned wrong bytes in logprobs API - Store raw bytes before JSON encoding to preserve them through the transfer - Add tests verifying the fix ## Problem When logprobs are requested, partial UTF-8 tokens (like individual bytes of an emoji 😊) return `[239, 191, 189]` (replacement character) instead of actual bytes like `[240]`. ## Root Cause Token strings get corrupted during JSON marshaling between runner and server - invalid UTF-8 is replaced with U+FFFD. ## Solution 1. Added `Bytes []byte` field to `llm.TokenLogprob` 2. Populate bytes in `CalculateLogprobs()` before JSON encoding 3. Use stored bytes in `toAPILogprobs()` instead of converting from corrupted string ## Test plan - Added `TestCalculateLogprobsPartialUTF8Bytes` - verifies raw bytes preserved - Added `TestLogprobBytesJSONRoundTrip` - verifies JSON round-trip preservation - Added `TestToAPILogprobsPreservesBytes` - verifies API conversion uses stored bytes Fixes #13497 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-23 01:05:45 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#40111