[GH-ISSUE #15234] gemma4:e4b drops accented/Unicode characters, producing garbled French text #56255

Open
opened 2026-04-29 10:29:16 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @vltrajvlien on GitHub (Apr 2, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15234

What is the issue?

gemma4:e4b silently drops accented and multi-byte UTF-8 characters from its output, producing garbled text in French. Words lose their accented characters entirely (not just the diacritics — the whole character is removed), making the output largely incoherent.

Prompt: "écris un texte en français"

Actual output (excerpt):

Puisque vous n'avez pas spifi sujet, je vais un texte qui que le temps, la beaut et la n s'arr.

l'art de l'arr. Cet art n'exige ni ticket de transport, ni itin pr ; il ne demande qu'une pause, un simple moment o l'on cesse de courir pour commencer .

le grlement lointain du caf bout, le chant hitant d'un oiseau qui teste sa modie, le bruissement l d'une feuille qui danse au gr'une brise ti.

Expected output should have proper accented characters:

  • "spifi" → "spécifié"
  • "beaut" → "beauté"
  • "s'arr" → "s'arrête"
  • "grlement" → "grondement" (or "grésillement")
  • "modie" → "mélodie"
  • etc.

Every é, è, ê, ë, à, ù, ç, ô, î and similar characters are stripped from the output.

Screenshot

gemma4-unicode-bug

(Ollama macOS app, gemma4:e4b, French text generation with widespread character drops)

Steps to reproduce

curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:e4b",
  "messages": [{"role":"user","content":"Écris un texte en français"}],
  "stream": false
}'

OS

macOS (Apple M4 Max, 64 GB)

Ollama version

0.20.0-rc0

GPU

Apple M4 Max (integrated)

  • #15229 — same bug on gemma4:31b (drops Unicode diacritics)
  • #15231 — same bug on gemma4:31b with Polish characters

This appears to be a tokenizer/detokenizer issue in the Gemma 4 architecture affecting all model sizes, not just 31b.

Originally created by @vltrajvlien on GitHub (Apr 2, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15234 ### What is the issue? `gemma4:e4b` silently drops accented and multi-byte UTF-8 characters from its output, producing garbled text in French. Words lose their accented characters entirely (not just the diacritics — the whole character is removed), making the output largely incoherent. **Prompt:** "écris un texte en français" **Actual output (excerpt):** > Puisque vous n'avez pas spifi sujet, je vais un texte qui que le temps, la beaut et la n s'arr. > l'art de l'arr. Cet art n'exige ni ticket de transport, ni itin pr ; il ne demande qu'une pause, un simple moment o l'on cesse de courir pour commencer . > le grlement lointain du caf bout, le chant hitant d'un oiseau qui teste sa modie, le bruissement l d'une feuille qui danse au gr'une brise ti. **Expected output** should have proper accented characters: - "spifi" → "spécifié" - "beaut" → "beauté" - "s'arr" → "s'arrête" - "grlement" → "grondement" (or "grésillement") - "modie" → "mélodie" - etc. Every `é`, `è`, `ê`, `ë`, `à`, `ù`, `ç`, `ô`, `î` and similar characters are stripped from the output. ### Screenshot ![gemma4-unicode-bug](https://github.com/user-attachments/assets/placeholder) *(Ollama macOS app, gemma4:e4b, French text generation with widespread character drops)* ### Steps to reproduce ```bash curl -s http://localhost:11434/api/chat -d '{ "model": "gemma4:e4b", "messages": [{"role":"user","content":"Écris un texte en français"}], "stream": false }' ``` ### OS macOS (Apple M4 Max, 64 GB) ### Ollama version 0.20.0-rc0 ### GPU Apple M4 Max (integrated) ### Related issues - #15229 — same bug on `gemma4:31b` (drops Unicode diacritics) - #15231 — same bug on `gemma4:31b` with Polish characters This appears to be a tokenizer/detokenizer issue in the Gemma 4 architecture affecting all model sizes, not just 31b.
Author
Owner

@martin-rizzo commented on GitHub (Apr 2, 2026):

The same issue occurs with the gemma4:e2b model when generating text in Spanish. Accented characters are also dropped.

<!-- gh-comment-id:4179930727 --> @martin-rizzo commented on GitHub (Apr 2, 2026): The same issue occurs with the gemma4:e2b model when generating text in Spanish. Accented characters are also dropped.
Author
Owner

@sebastian-bort-leki-pl commented on GitHub (Apr 2, 2026):

In polish also.

<!-- gh-comment-id:4180313973 --> @sebastian-bort-leki-pl commented on GitHub (Apr 2, 2026): In polish also.
Author
Owner

@TeoDragan commented on GitHub (Apr 2, 2026):

I can reproduce a similar issue in Romanian.

Environment:

  • model: gemma4:26b
  • Ollama: 0.20.0-rc1
  • macOS
  • Ollama desktop app

Observed behavior:
Romanian diacritics are corrupted in the output. Examples:

  • Dacă becomes Dac�
  • există becomes exist�
  • țări becomes corrupted
  • școli becomes corrupted
  • pregătit becomes corrupted

The semantic content is still mostly correct, but Unicode / accented characters are not rendered properly.

This suggests the problem is not limited to French and also affects Romanian.

<!-- gh-comment-id:4180350477 --> @TeoDragan commented on GitHub (Apr 2, 2026): I can reproduce a similar issue in Romanian. Environment: - model: `gemma4:26b` - Ollama: `0.20.0-rc1` - macOS - Ollama desktop app Observed behavior: Romanian diacritics are corrupted in the output. Examples: - `Dacă` becomes `Dac�` - `există` becomes `exist�` - `țări` becomes corrupted - `școli` becomes corrupted - `pregătit` becomes corrupted The semantic content is still mostly correct, but Unicode / accented characters are not rendered properly. This suggests the problem is not limited to French and also affects Romanian.
Author
Owner

@Naimad1CZ commented on GitHub (Apr 2, 2026):

This problem also affects gemma4:26b-a4b-it-q8_0 and gemma4:31b-it-q8_0 in Czech.
The text gets often so corrupt that I'm not only not sure what was originally meant, but I'm also not sure whether the answer would make sense if the Unicode characters were correctly rendered.

Some examples (I'm running a private benchmark with some questions; I'm showing usually parts of the answer):
gemma4:26b-a4b-it-q8_0:

Odůvodnn
Otka „Co dl?“ se tkn vztahů a rodinn
lenů pacienta. Informace o tom, co dlaj blac jsou sou
osobn kontextu a znalostn o jeho rodin, nikoliv obecn aktu zpr nebo dat ze senzorů.

Should be something like:

Odůvodnění
Otázka „Co dělá?“ se týkákn vztahů a rodi nn
lenů (???) pacienta. Informace o tom, co děla j blac jsou sou (???)
osobního kontextu a znalostní o jeho rodině, nikoliv obecných aktuálních zpráv nebo dat ze senzorů.

Just WTF:

Otka „Co dl?“ se tkn vztahů a rodinn přn pacienta. Informace o tom, k je jeho syn, co dl kdy ho naposledy navšt, nejsou obecn znou skute
nostB), ani technick ajem z okolC). Tato informace je uložena v osobnostn (knowledge base), kter specifick o rodin a život pacienta.

acient se pt konkrnaci tkaj jeho soci okolsoused Tato informace nenn faktem (B), nen ze senzorů v byt (C) a nen pouze bžn bavorsk rozhovor (D). lo na souseda je specifickick, kter mla b uložena v jeho osobnostn (knowledge base) spolu s kontakty na rodinu a bl

gemma4:31b-it-q8_0:

Kategorie: A) (volostn s praktick informacemi o pacientovi, jeho rodin, denn rozvrhu a zař ve kter žije).

Should be:

Kategorie: A) (volostn (???) s praktickými informacemi o pacientovi, jeho rodině, denním rozvrhu a zařízení ve kterém žije).

Just WTF:

Sprnifikace je: A)

**Zdůvodnn Otka se tkn, kter pacientem interakci. Informace o nštv, person nebo rodinn přn v r denn rozvrhu jsou uloženy v osobnostn pacienta.

as obvykle nachnebo zda mlovanou nštvu), patř znalostn s praktick informacemi o pacientovi a jeho rozvrhu.

Ollama downloaded from https://github.com/ollama/ollama/releases/download/v0.20.0-rc1/ollama-linux-amd64.tar.zst

<!-- gh-comment-id:4180486479 --> @Naimad1CZ commented on GitHub (Apr 2, 2026): This problem also affects `gemma4:26b-a4b-it-q8_0` and `gemma4:31b-it-q8_0` in Czech. The text gets often so corrupt that I'm not only not sure what was originally meant, but I'm also not sure whether the answer would make sense if the Unicode characters were correctly rendered. Some examples (I'm running a private benchmark with some questions; I'm showing usually parts of the answer): `gemma4:26b-a4b-it-q8_0`: > Odůvodnn Otka „Co dl?“ se tkn vztahů a rodinn lenů pacienta. Informace o tom, co dlaj blac jsou sou osobn kontextu a znalostn o jeho rodin, nikoliv obecn aktu zpr nebo dat ze senzorů. Should be something like: > Odůvodn**ění** Ot**áz**ka „Co d**ělá**?“ se t**ýká**k~~n~~ vztahů a rodi _nn_ _lenů (???)_ pacienta. Informace o tom, co d**ě**la _j blac jsou sou (???)_ osobn**ího** kontextu a znalost~~n~~**í** o jeho rodin**ě**, nikoliv obecn**ých** aktu**álních** zpr**áv** nebo dat ze senzorů. Just WTF: > Otka „Co dl?“ se tkn vztahů a rodinn přn pacienta. Informace o tom, k je jeho syn, co dl kdy ho naposledy navšt, nejsou obecn znou skute nostB), ani technick ajem z okolC). Tato informace je uložena v osobnostn (knowledge base), kter specifick o rodin a život pacienta. > acient se pt konkrnaci tkaj jeho soci okolsoused Tato informace nenn faktem (B), nen ze senzorů v byt (C) a nen pouze bžn bavorsk rozhovor (D). lo na souseda je specifickick, kter mla b uložena v jeho osobnostn (knowledge base) spolu s kontakty na rodinu a bl `gemma4:31b-it-q8_0`: > Kategorie: **A)** (volostn s praktick informacemi o pacientovi, jeho rodin, denn rozvrhu a zař ve kter žije). Should be: > Kategorie: **A)** (_volostn (???)_ s praktick**ými** informacemi o pacientovi, jeho rodin**ě**, denn**ím** rozvrhu a zař**ízení** ve kter**ém** žije). Just WTF: > Sprnifikace je: **A)** > > **Zdůvodnn Otka se tkn, kter pacientem interakci. Informace o nštv, person nebo rodinn přn v r denn rozvrhu jsou uloženy v osobnostn pacienta. > as obvykle nachnebo zda mlovanou nštvu), patř znalostn s praktick informacemi o pacientovi a jeho rozvrhu. Ollama downloaded from `https://github.com/ollama/ollama/releases/download/v0.20.0-rc1/ollama-linux-amd64.tar.zst`
Author
Owner

@oceancholic commented on GitHub (Apr 2, 2026):

Same with Turkish characters. Strangely model itself process the prompt correctly and as far as I understand also responds correctly but ollama itself dropping the characters. (not only single character but sometimes whole word after the special character. Emoji's works tho.)

<!-- gh-comment-id:4180520385 --> @oceancholic commented on GitHub (Apr 2, 2026): Same with Turkish characters. Strangely model itself process the prompt correctly and as far as I understand also responds correctly but ollama itself dropping the characters. (not only single character but sometimes whole word after the special character. Emoji's works tho.)
Author
Owner

@szmarczak commented on GitHub (Apr 2, 2026):

Minutes ago there has been published a fix: https://github.com/ollama/ollama/releases/tag/v0.20.0

It's been fixed (at least for Polish characters).

<!-- gh-comment-id:4180532173 --> @szmarczak commented on GitHub (Apr 2, 2026): Minutes ago there has been published a fix: https://github.com/ollama/ollama/releases/tag/v0.20.0 It's been fixed (at least for Polish characters).
Author
Owner

@emby commented on GitHub (Apr 2, 2026):

I am experiencing this exact same issue with Norwegian characters on v0.20.0-rc1. When running gemma4:26b and gemma4:31b via the Ollama CLI, special characters like æ, ø, and å (as well as others like ä and é) completely vanish from the output stream.

For context:

  • Environment: macOS (M3 Max)
  • Terminals tested: Ghostty (running Bash)
  • Symptom: The letters are just swallowed entirely. For example, the word "snøen" becomes "snen", and "frå" becomes "fr".

Since the text generates perfectly when testing models via API or in Android Studio, this seems to be strictly isolated to how the Ollama CLI handles UTF-8 token decoding or stream rendering for certain models. Hope this helps narrow it down!

<!-- gh-comment-id:4180637770 --> @emby commented on GitHub (Apr 2, 2026): I am experiencing this exact same issue with Norwegian characters on v0.20.0-rc1. When running `gemma4:26b` and `gemma4:31b` via the Ollama CLI, special characters like `æ`, `ø`, and `å` (as well as others like `ä` and `é`) completely vanish from the output stream. For context: - **Environment:** macOS (M3 Max) - **Terminals tested:** Ghostty (running Bash) - **Symptom:** The letters are just swallowed entirely. For example, the word "snøen" becomes "snen", and "frå" becomes "fr". Since the text generates perfectly when testing models via API or in Android Studio, this seems to be strictly isolated to how the Ollama CLI handles UTF-8 token decoding or stream rendering for certain models. Hope this helps narrow it down!
Author
Owner

@szmarczak commented on GitHub (Apr 2, 2026):

@emby I cannot reproduce on latest pre-release:

>>> Say "æøåäé" and nothing else.
Thinking...
*   Input: 'Say "æøåäé" and nothing else.'
    *   Constraint: Output only "æøåäé".

    *   The user wants specific characters: æ, ø, å, ä, é.
    *   The user explicitly stated "and nothing else".

    *   Result: æøåäé
...done thinking.

æøåäé
<!-- gh-comment-id:4180686668 --> @szmarczak commented on GitHub (Apr 2, 2026): @emby I cannot reproduce on latest pre-release: ``` >>> Say "æøåäé" and nothing else. Thinking... * Input: 'Say "æøåäé" and nothing else.' * Constraint: Output only "æøåäé". * The user wants specific characters: æ, ø, å, ä, é. * The user explicitly stated "and nothing else". * Result: æøåäé ...done thinking. æøåäé ```
Author
Owner

@emby commented on GitHub (Apr 2, 2026):

No, it works on 0.20.0. My comment was for v0.20.0-rc1, sorry for not writing that. Thanks!

<!-- gh-comment-id:4180722457 --> @emby commented on GitHub (Apr 2, 2026): No, it works on 0.20.0. My comment was for v0.20.0-rc1, sorry for not writing that. Thanks!
Author
Owner

@M4T3CITO commented on GitHub (Apr 6, 2026):

The problem still persists; in Spanish, it responds with unintelligible texts.

<!-- gh-comment-id:4190895864 --> @M4T3CITO commented on GitHub (Apr 6, 2026): The problem still persists; in Spanish, it responds with unintelligible texts.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56255