[GH-ISSUE #15229] gemma4:31b drops all Unicode diacritics (accented characters) in output #56251

Closed
opened 2026-04-29 10:29:01 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @hydropix on GitHub (Apr 2, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15229

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

gemma4:31b strips all multi-byte UTF-8 characters (accents, diacritics) from both the content and thinking fields. The output is intelligible but every accented character is silently removed.

Example: asking for a short French sentence with accents produces:

"L' a mang g  for."

instead of something like "L'élève a mangé gâteau en forêt." — every é, è, ê, ù, ç, à is dropped.

This does not happen with gemma3:27b on the same Ollama instance, which correctly outputs accented characters.

Steps to reproduce

# Non-streaming — accents missing
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:31b",
  "messages": [{"role":"user","content":"Écris une phrase courte en français avec des accents (é, è, ê, ù, ç, à)."}],
  "stream": false
}'
# Response: {"message":{"content":"L' a mang g  for.", ...}}

# Same prompt with gemma3:27b — accents present
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma3:27b",
  "messages": [{"role":"user","content":"Écris une phrase courte en français avec des accents (é, è, ê, ù, ç, à)."}],
  "stream": false
}'
# Response: {"message":{"content":"L'élu a dû gérer l'événement malgré..."}}

The thinking field is also affected — diacritics are missing there too.

Disabling thinking with "think": false does not fix the issue.

Environment

  • Ollama version: 0.20.0-rc0
  • OS: Windows (remote server)
  • Model: gemma4:31b (also tested gemma4:latest)
  • Comparison: gemma3:27b works correctly on the same instance

Expected behavior

Accented characters (é, è, ê, ù, ç, à, etc.) should be preserved in the output, as they are with gemma3.

Additional context

Likely related to the tokenizer/detokenizer for the new Gemma4 architecture introduced in PR #15214. The issue affects all output fields (content and thinking), suggesting it happens at the token decoding stage.

Originally created by @hydropix on GitHub (Apr 2, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15229 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? `gemma4:31b` strips all multi-byte UTF-8 characters (accents, diacritics) from both the `content` and `thinking` fields. The output is intelligible but every accented character is silently removed. **Example:** asking for a short French sentence with accents produces: ``` "L' a mang g for." ``` instead of something like `"L'élève a mangé gâteau en forêt."` — every `é`, `è`, `ê`, `ù`, `ç`, `à` is dropped. This does **not** happen with `gemma3:27b` on the same Ollama instance, which correctly outputs accented characters. ### Steps to reproduce ```bash # Non-streaming — accents missing curl -s http://localhost:11434/api/chat -d '{ "model": "gemma4:31b", "messages": [{"role":"user","content":"Écris une phrase courte en français avec des accents (é, è, ê, ù, ç, à)."}], "stream": false }' # Response: {"message":{"content":"L' a mang g for.", ...}} # Same prompt with gemma3:27b — accents present curl -s http://localhost:11434/api/chat -d '{ "model": "gemma3:27b", "messages": [{"role":"user","content":"Écris une phrase courte en français avec des accents (é, è, ê, ù, ç, à)."}], "stream": false }' # Response: {"message":{"content":"L'élu a dû gérer l'événement malgré..."}} ``` The `thinking` field is also affected — diacritics are missing there too. Disabling thinking with `"think": false` does **not** fix the issue. ### Environment - **Ollama version:** 0.20.0-rc0 - **OS:** Windows (remote server) - **Model:** `gemma4:31b` (also tested `gemma4:latest`) - **Comparison:** `gemma3:27b` works correctly on the same instance ### Expected behavior Accented characters (é, è, ê, ù, ç, à, etc.) should be preserved in the output, as they are with gemma3. ### Additional context Likely related to the tokenizer/detokenizer for the new Gemma4 architecture introduced in PR #15214. The issue affects all output fields (`content` and `thinking`), suggesting it happens at the token decoding stage.
Author
Owner

@szmarczak commented on GitHub (Apr 2, 2026):

Not all characters are dropped: https://github.com/ollama/ollama/issues/15231

<!-- gh-comment-id:4179727697 --> @szmarczak commented on GitHub (Apr 2, 2026): Not all characters are dropped: https://github.com/ollama/ollama/issues/15231
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56251