[GH-ISSUE #15502] gemma4:31b repetition loop during constrained JSON generation with free-text string fields #71968

New Issue

GiteaMirror · 2026-05-05T03:10:36-05:00

GiteaMirror commented

2026-05-05 03:10:36 -05:00

Originally created by @rnh0 on GitHub (Apr 11, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15502

Summary

gemma4:31b enters word-repetition loops when generating long free-text strings inside a JSON schema constraint (format=). A word doubles, then collapses into a single repeated token that fills the remaining num_predict budget, leaving the JSON unterminated. Bug rate is 60-100% depending on the prompt, across 39 trials.

This is NOT the <unused> token / GEMV buffer overlap bug (llama.cpp#21321 / #21566). The repeated tokens are normal English words, and zero <unused> tokens were observed in any trial.

Observed behavior

Actual output from the minimal repro below (seed=0):

{
  "description": "The horizon is a bleeding, vibrant, orange-gold single own line where the sky meets
the ocean, as the sun dips low, casting a long, shimmering path of amber own own own own own own own
own own own own own own own own own own own own own own own own own own own own own own own own own

(300 chars, unterminated JSON — the "own" token repeats until num_predict is exhausted.)

The cascade pattern:

Normal generation starts fine inside a JSON string value
A token intrudes: "amber own"
Collapses into a single repeated token: "own own own own own..."
Fills remaining num_predict budget (8192 tokens)
JSON left unterminated -> parse error

Root cause isolation

We ran 39 trials across 13 test configurations, varying one condition at a time. Three conditions are all required to trigger the bug:

#	Test	Rep Bugs	JSON Fail	What it proves
1	4 different prompts + schema + free-text	8/12	10/12	Not prompt-specific (60-100% rate)
2	no format= (free generation)	0/3	0/3	format= IS required
3	schema + no free-text fields	0/3	0/3	Free-text strings in JSON trigger it
4	schema + free-text + think=False	0/3	3/3	No repetition, but JSON broken (#15260)
5a	gemma4:26b (MoE) + schema + free-text	0/3	3/3	Dense (31b) only, MoE has different JSON issues
5b	gemma3:27b + schema + free-text	0/3	0/3	gemma4-specific regression
6a	repeat_penalty=1.0	2/3	2/3	Penalty has no effect
6b	repeat_penalty=1.15	2/3	2/3	Same seeds fail regardless
6c	repeat_penalty=1.5	2/3	2/3	Cannot suppress the cascade

The three necessary conditions

gemma4:31b (Dense) — gemma4:26b (MoE) and gemma3:27b do not exhibit this bug
format= with a JSON schema — removing the grammar constraint eliminates the bug entirely
Free-text string fields in the schema (e.g., "description": {"type": "string"} requesting multi-sentence output) — a simple schema with only arrays and enums is clean

Vision input is not required — text-only prompts reproduce at the same rate.

Note: The test matrix was collected using a longer prompt with vision input variants. The same bug reproduces with the simplified text-only prompt shown below. The minimal repro was verified independently.

Minimal reproduction

No images, no dependencies beyond the ollama Python package. Seeds 0 and 84 hit the repetition loop; seed 42 produces malformed JSON of a different kind. All 3/3 seeds produce broken output.

import ollama

SCHEMA = {
    "type": "object",
    "required": ["description", "analysis", "tags"],
    "properties": {
        "description": {
            "type": "string",
            "description": "At least 3 detailed sentences.",
        },
        "analysis": {
            "type": "string",
            "description": "Several paragraphs of analysis.",
        },
        "tags": {"type": "array", "items": {"type": "string"}},
    },
}

PROMPT = (
    "Describe a beach scene at sunset in detail. "
    "Write at least 3 full sentences for description "
    "and several paragraphs for analysis."
)

response = ollama.chat(
    model="gemma4:31b",
    messages=[{"role": "user", "content": PROMPT}],
    format=SCHEMA,
    options={
        "num_ctx": 32768,
        "num_predict": 8192,
        "repeat_penalty": 1.15,
        "repeat_last_n": 256,
        "seed": 0,
    },
)
content = response.message.content
print(f"Length: {len(content)} chars")
print(f"Tail: ...{content[-200:]}")

Expected output

Valid JSON (~500-2000 chars) with description, analysis, and tags fields, properly terminated.

Actual output

Unterminated JSON (300 chars with seed=0, up to ~33,000 chars with other seeds). The description or analysis field enters a repetition loop partway through and fills the remaining token budget.

Expected vs actual behavior

	Expected	Actual
Output	Valid, terminated JSON with multi-sentence free-text fields	Word repetition loop fills num_predict budget, JSON unterminated
repeat_penalty	Higher values should suppress repetition	No effect at any tested value (1.0, 1.15, 1.5) — same seeds fail identically
Grammar constraint	Should enforce valid JSON structure	Grammar allows the repeated word because it's a valid string character sequence

System info

Component	Value
GPU	NVIDIA GeForce RTX 5090 (32 GB VRAM)
Driver (running kernel module)	580.126.16
CUDA Version	13.0
Ollama	0.20.5
OS	Ubuntu 24.04.3 LTS
Kernel	6.17.0-14-generic x86_64
CPU	AMD Ryzen 7 9800X3D
Model	gemma4:31b (SHA `6316f0629137`, 19 GB)

Additional context

Why repeat_penalty has no effect

We tested repeat_penalty at 1.0, 1.15, and 1.5 — identical seeds fail identically at all values. Our hypothesis: the grammar constraint limits token choices at each step, and inside a JSON string value any valid string content (including word repetition) is allowed. If the model's logit distribution degenerates to strongly favor a single token, the grammar has no mechanism to reject it regardless of penalty strength.

Interaction with ollama#15260

Test 5 (think=False + format=) produced 0/3 repetition bugs but 3/3 JSON failures (output was plain text, not JSON). This confirms #15260: when thinking is disabled, the format constraint is never applied because the end-of-thinking token never fires. This accidentally "fixes" the repetition bug by removing the grammar constraint entirely — but breaks structured output.

gemma4:26b (MoE) behavior

The MoE variant produced 0/3 repetition bugs but 3/3 JSON failures of a different kind (malformed JSON, not repetition loops). The MoE model has separate structured output issues that may be related to #15428.

Not the GEMV buffer overlap bug

Zero <unused> tokens were observed across all 39 trials. The repeated tokens are normal English words ("beach", "own", "same", "companion", "fatigue,en"). This is a distinct bug from the CUDA GEMV fusion buffer overlap fixed in llama.cpp#21566 / b8702.

ollama/ollama#15260 — think=false breaks format= (format constraint silently ignored)
ollama/ollama#15386 — Structured output contradicts model's own thinking (constrained decoding vs thinking tension)
ollama/ollama#15350 — Flash Attention hangs on gemma4:31b Dense (different bug, same model)
ollama/ollama#15428 — gemma4:26b empty response with long system prompts
ggml-org/llama.cpp#21321 — Gemma 4 <unused24> tokens (GEMV buffer overlap — different root cause)
ggml-org/llama.cpp#21566 — Fix for GEMV buffer overlap (does NOT fix this bug)

Tested on 2026-04-11. 39 trials across 13 test configurations.

The test matrix and isolation methodology were designed with assistance from Claude Code (Anthropic). All tests were run locally on the hardware described above. Results are deterministic and independently reproducible.

Originally created by @rnh0 on GitHub (Apr 11, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15502 ## Summary `gemma4:31b` enters word-repetition loops when generating long free-text strings inside a JSON schema constraint (`format=`). A word doubles, then collapses into a single repeated token that fills the remaining `num_predict` budget, leaving the JSON unterminated. Bug rate is **60-100%** depending on the prompt, across 39 trials. **This is NOT the `<unused>` token / GEMV buffer overlap bug** (llama.cpp#21321 / #21566). The repeated tokens are normal English words, and zero `<unused>` tokens were observed in any trial. ## Observed behavior Actual output from the minimal repro below (`seed=0`): ```json { "description": "The horizon is a bleeding, vibrant, orange-gold single own line where the sky meets the ocean, as the sun dips low, casting a long, shimmering path of amber own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own ``` (300 chars, unterminated JSON — the `"own"` token repeats until `num_predict` is exhausted.) The cascade pattern: 1. Normal generation starts fine inside a JSON string value 2. A token intrudes: `"amber own"` 3. Collapses into a single repeated token: `"own own own own own..."` 4. Fills remaining `num_predict` budget (8192 tokens) 5. JSON left unterminated -> parse error ## Root cause isolation We ran 39 trials across 13 test configurations, varying one condition at a time. Three conditions are **all required** to trigger the bug: | # | Test | Rep Bugs | JSON Fail | What it proves | |---|------|----------|-----------|----------------| | 1 | 4 different prompts + schema + free-text | 8/12 | 10/12 | Not prompt-specific (60-100% rate) | | 2 | **no format=** (free generation) | **0/3** | 0/3 | **format= IS required** | | 3 | schema + **no free-text fields** | **0/3** | 0/3 | **Free-text strings in JSON trigger it** | | 4 | schema + free-text + **think=False** | **0/3** | 3/3 | No repetition, but JSON broken (#15260) | | 5a | **gemma4:26b** (MoE) + schema + free-text | **0/3** | 3/3 | **Dense (31b) only**, MoE has different JSON issues | | 5b | **gemma3:27b** + schema + free-text | **0/3** | 0/3 | **gemma4-specific regression** | | 6a | repeat_penalty=**1.0** | 2/3 | 2/3 | Penalty has no effect | | 6b | repeat_penalty=**1.15** | 2/3 | 2/3 | Same seeds fail regardless | | 6c | repeat_penalty=**1.5** | 2/3 | 2/3 | Cannot suppress the cascade | ### The three necessary conditions 1. **`gemma4:31b` (Dense)** — gemma4:26b (MoE) and gemma3:27b do not exhibit this bug 2. **`format=` with a JSON schema** — removing the grammar constraint eliminates the bug entirely 3. **Free-text string fields in the schema** (e.g., `"description": {"type": "string"}` requesting multi-sentence output) — a simple schema with only arrays and enums is clean Vision input is **not** required — text-only prompts reproduce at the same rate. *Note: The test matrix was collected using a longer prompt with vision input variants. The same bug reproduces with the simplified text-only prompt shown below. The minimal repro was verified independently.* ## Minimal reproduction No images, no dependencies beyond the `ollama` Python package. Seeds 0 and 84 hit the repetition loop; seed 42 produces malformed JSON of a different kind. All 3/3 seeds produce broken output. ```python import ollama SCHEMA = { "type": "object", "required": ["description", "analysis", "tags"], "properties": { "description": { "type": "string", "description": "At least 3 detailed sentences.", }, "analysis": { "type": "string", "description": "Several paragraphs of analysis.", }, "tags": {"type": "array", "items": {"type": "string"}}, }, } PROMPT = ( "Describe a beach scene at sunset in detail. " "Write at least 3 full sentences for description " "and several paragraphs for analysis." ) response = ollama.chat( model="gemma4:31b", messages=[{"role": "user", "content": PROMPT}], format=SCHEMA, options={ "num_ctx": 32768, "num_predict": 8192, "repeat_penalty": 1.15, "repeat_last_n": 256, "seed": 0, }, ) content = response.message.content print(f"Length: {len(content)} chars") print(f"Tail: ...{content[-200:]}") ``` ### Expected output Valid JSON (~500-2000 chars) with `description`, `analysis`, and `tags` fields, properly terminated. ### Actual output Unterminated JSON (300 chars with `seed=0`, up to ~33,000 chars with other seeds). The `description` or `analysis` field enters a repetition loop partway through and fills the remaining token budget. ## Expected vs actual behavior | | Expected | Actual | |---|----------|--------| | **Output** | Valid, terminated JSON with multi-sentence free-text fields | Word repetition loop fills num_predict budget, JSON unterminated | | **repeat_penalty** | Higher values should suppress repetition | No effect at any tested value (1.0, 1.15, 1.5) — same seeds fail identically | | **Grammar constraint** | Should enforce valid JSON structure | Grammar allows the repeated word because it's a valid string character sequence | ## System info | Component | Value | |-----------|-------| | GPU | NVIDIA GeForce RTX 5090 (32 GB VRAM) | | Driver (running kernel module) | 580.126.16 | | CUDA Version | 13.0 | | Ollama | **0.20.5** | | OS | Ubuntu 24.04.3 LTS | | Kernel | 6.17.0-14-generic x86_64 | | CPU | AMD Ryzen 7 9800X3D | | Model | gemma4:31b (SHA `6316f0629137`, 19 GB) | ## Additional context ### Why repeat_penalty has no effect We tested `repeat_penalty` at 1.0, 1.15, and 1.5 — identical seeds fail identically at all values. Our hypothesis: the grammar constraint limits token choices at each step, and inside a JSON string value any valid string content (including word repetition) is allowed. If the model's logit distribution degenerates to strongly favor a single token, the grammar has no mechanism to reject it regardless of penalty strength. ### Interaction with ollama#15260 Test 5 (`think=False` + `format=`) produced 0/3 repetition bugs but 3/3 JSON failures (output was plain text, not JSON). This confirms #15260: when thinking is disabled, the format constraint is never applied because the end-of-thinking token never fires. This accidentally "fixes" the repetition bug by removing the grammar constraint entirely — but breaks structured output. ### gemma4:26b (MoE) behavior The MoE variant produced 0/3 repetition bugs but 3/3 JSON failures of a different kind (malformed JSON, not repetition loops). The MoE model has separate structured output issues that may be related to #15428. ### Not the GEMV buffer overlap bug Zero `<unused>` tokens were observed across all 39 trials. The repeated tokens are normal English words (`"beach"`, `"own"`, `"same"`, `"companion"`, `"fatigue,en"`). This is a distinct bug from the CUDA GEMV fusion buffer overlap fixed in llama.cpp#21566 / b8702. ## Related issues - ollama/ollama#15260 — `think=false` breaks `format=` (format constraint silently ignored) - ollama/ollama#15386 — Structured output contradicts model's own thinking (constrained decoding vs thinking tension) - ollama/ollama#15350 — Flash Attention hangs on gemma4:31b Dense (different bug, same model) - ollama/ollama#15428 — gemma4:26b empty response with long system prompts - ggml-org/llama.cpp#21321 — Gemma 4 `<unused24>` tokens (GEMV buffer overlap — different root cause) - ggml-org/llama.cpp#21566 — Fix for GEMV buffer overlap (does NOT fix this bug) --- *Tested on 2026-04-11. 39 trials across 13 test configurations.* *The test matrix and isolation methodology were designed with assistance from Claude Code (Anthropic). All tests were run locally on the hardware described above. Results are deterministic and independently reproducible.*

GiteaMirror commented

2026-05-05 03:10:38 -05:00

@rick-github commented on GitHub (Apr 11, 2026):

--- 15502.py.orig	2026-04-11 17:11:08.099297197 +0200
+++ 15502.py	2026-04-11 17:09:47.205267951 +0200
@@ -1,4 +1,5 @@
 import ollama
+import json
 
 SCHEMA = {
     "type": "object",
@@ -20,6 +21,7 @@
     "Describe a beach scene at sunset in detail. "
     "Write at least 3 full sentences for description "
     "and several paragraphs for analysis."
+    "\nReturn a JSON structure with this schema: " + json.dumps(SCHEMA)
 )
 
 response = ollama.chat(

The model doesn't know that the tokens it's going to generate are going to be constrained, so the probability distribution is not prepared for the first token to be restricted to {, which is a fairly low probability token in an output talking about a beach. By telling the model the expected output format, it can better prepare.

{
  "description": "The horizon is painted in vibrant hues of molten gold, deep violet, and burnt orange as the sun dips slowly below the waterline. Gentle, rhythmic waves lap against the powdery white sand, leaving behind a shimmering mirror of wet shoreline that reflects the kaleidoscope of the sky. A few scattered seashells and a lone, weathered driftwood log lie nestled in the tide's reach, while the salty breeze carries the distant, melodic cry of a departing seagull.",
  "analysis": "The scene described is a classic study in chromatic contrast and atmospheric peace. By utilizing a color palette of 'molten gold' and 'deep violet,' the description evokes a sense of luxury and transition, marking the boundary between the luminosity of day and the mystery of night. The focus on rhythmic waves and the 'shimmering mirror' emphasizes a theme of duality—the physical world meeting its own reflection—which suggests a moment of introspection and stillness.\n\nFrom a structural standpoint, the sensory details are carefully curated to move from the macroscopic to the microscopic. The narrative begins with the vastness of the horizon, narrows down to the interaction of water and sand, and finally rests on small, tangible objects like seashells and driftwood. This narrowing focus mimics the act of observing a a peaceful environment, grounding the viewer in the a localized reality while maintaining an awareness of the cosmic scale of the sunset.\n\nFurthermore, the auditory elements, such as the 'melodic cry' of the seagull and the 'rhythmic' sound of the waves, serve to fill the silence without disrupting it. These sounds provide a rhythmic cadence to the scene, reinforcing the idea that nature operates on a predictable, soothing cycle. The overall effect is one of transcendental tranquility, where the observer is invited to pause and experience the ephemeral beauty of a fleeting moment.",
  "tags": [
    "nature",
    "sunset",
    "beach",
    "atmospheric",
    "descriptive writing"
  ]
}

@rick-github commented on GitHub (Apr 11, 2026): ```diff --- 15502.py.orig 2026-04-11 17:11:08.099297197 +0200 +++ 15502.py 2026-04-11 17:09:47.205267951 +0200 @@ -1,4 +1,5 @@ import ollama +import json SCHEMA = { "type": "object", @@ -20,6 +21,7 @@ "Describe a beach scene at sunset in detail. " "Write at least 3 full sentences for description " "and several paragraphs for analysis." + "\nReturn a JSON structure with this schema: " + json.dumps(SCHEMA) ) response = ollama.chat( ``` The model doesn't know that the tokens it's going to generate are going to be constrained, so the probability distribution is not prepared for the first token to be restricted to `{`, which is a fairly low probability token in an output talking about a beach. By telling the model the expected output format, it can better prepare. ```json { "description": "The horizon is painted in vibrant hues of molten gold, deep violet, and burnt orange as the sun dips slowly below the waterline. Gentle, rhythmic waves lap against the powdery white sand, leaving behind a shimmering mirror of wet shoreline that reflects the kaleidoscope of the sky. A few scattered seashells and a lone, weathered driftwood log lie nestled in the tide's reach, while the salty breeze carries the distant, melodic cry of a departing seagull.", "analysis": "The scene described is a classic study in chromatic contrast and atmospheric peace. By utilizing a color palette of 'molten gold' and 'deep violet,' the description evokes a sense of luxury and transition, marking the boundary between the luminosity of day and the mystery of night. The focus on rhythmic waves and the 'shimmering mirror' emphasizes a theme of duality—the physical world meeting its own reflection—which suggests a moment of introspection and stillness.\n\nFrom a structural standpoint, the sensory details are carefully curated to move from the macroscopic to the microscopic. The narrative begins with the vastness of the horizon, narrows down to the interaction of water and sand, and finally rests on small, tangible objects like seashells and driftwood. This narrowing focus mimics the act of observing a a peaceful environment, grounding the viewer in the a localized reality while maintaining an awareness of the cosmic scale of the sunset.\n\nFurthermore, the auditory elements, such as the 'melodic cry' of the seagull and the 'rhythmic' sound of the waves, serve to fill the silence without disrupting it. These sounds provide a rhythmic cadence to the scene, reinforcing the idea that nature operates on a predictable, soothing cycle. The overall effect is one of transcendental tranquility, where the observer is invited to pause and experience the ephemeral beauty of a fleeting moment.", "tags": [ "nature", "sunset", "beach", "atmospheric", "descriptive writing" ] } ```

GiteaMirror commented

2026-05-05 03:10:38 -05:00

@rnh0 commented on GitHub (Apr 11, 2026):

Thanks for the quick reply and suggestion! Confirmed — including the schema in the prompt does significantly improve the toy example from the original report (0/5 repetition loops, 5/5 valid JSON vs. 2/3 and 0/3 without it).

However, stress testing shows the improvement doesn't hold for more demanding use cases.

For each test below, the prompt includes "\nReturn a JSON structure with this schema: " + json.dumps(SCHEMA) as you suggested, and format=SCHEMA is set. 10 seeds per test, same options as the original report.

Test	Repetition loops	Valid JSON
Short output (original repro)	0/5	5/5
1000+ words requested	10/10	0/10
Complex nested schema (5 fields, nested objects, enums)	7/8	1/8
6 paragraph-length free-text fields	3/10	7/10
Vision input + schema in prompt	3/10	4/10
Minimal hint (`"Respond in JSON."` instead of full schema)	4/10	5/10

The workaround helps the model prepare for JSON output during thinking, which is enough for short responses. But the underlying degeneration still occurs during sustained free-text generation inside JSON strings — the model's logit distribution still collapses into single-token repetition regardless of prompt priming.

The word doubling that appears even in successful outputs ("a a", "the the", "sapphire sapphire") seems to be the precursor. For short outputs, generation ends before it cascades. For longer outputs, it inevitably does.

Stress test script: https://gist.github.com/rnh0/18a7f25c70da00c8e47235e849bc5798

Note: the test crashed partway through when ollama hung during model unload ("Stopping..." state). The complex schema result is 8 trials instead of 10.

@rnh0 commented on GitHub (Apr 11, 2026): Thanks for the quick reply and suggestion! Confirmed — including the schema in the prompt does significantly improve the toy example from the original report (0/5 repetition loops, 5/5 valid JSON vs. 2/3 and 0/3 without it). However, stress testing shows the improvement doesn't hold for more demanding use cases. For each test below, the prompt includes `"\nReturn a JSON structure with this schema: " + json.dumps(SCHEMA)` as you suggested, and `format=SCHEMA` is set. 10 seeds per test, same options as the original report. | Test | Repetition loops | Valid JSON | |------|-----------------|------------| | Short output (original repro) | 0/5 | 5/5 | | **1000+ words requested** | **10/10** | 0/10 | | Complex nested schema (5 fields, nested objects, enums) | 7/8 | 1/8 | | 6 paragraph-length free-text fields | 3/10 | 7/10 | | Vision input + schema in prompt | 3/10 | 4/10 | | Minimal hint (`"Respond in JSON."` instead of full schema) | 4/10 | 5/10 | The workaround helps the model prepare for JSON output during thinking, which is enough for short responses. But the underlying degeneration still occurs during sustained free-text generation inside JSON strings — the model's logit distribution still collapses into single-token repetition regardless of prompt priming. The word doubling that appears even in successful outputs (`"a a"`, `"the the"`, `"sapphire sapphire"`) seems to be the precursor. For short outputs, generation ends before it cascades. For longer outputs, it inevitably does. Stress test script: https://gist.github.com/rnh0/18a7f25c70da00c8e47235e849bc5798 Note: the test crashed partway through when ollama hung during model unload ("Stopping..." state). The complex schema result is 8 trials instead of 10.

GiteaMirror commented

2026-05-05 03:10:39 -05:00

@rnh0 commented on GitHub (Apr 11, 2026):

Correction: Our original report stated that gemma4:26b (MoE) did not exhibit the repetition bug (0/3 in our initial test). With more trials this turns out to be wrong.

Expanded testing with 10 seeds per test:

Test	gemma4:26b (MoE) Rep / Valid	gemma4:31b (Dense) Rep / Valid
Short output	4/10 / 1/10	7/10 / 1/10
1000+ words	5/10 / 1/10	(not re-run)
Complex nested schema	4/10 / 0/10	(not re-run)
6 free-text fields	4/10 / 0/10	(not re-run)

Both model variants are affected. The 31b Dense has a higher repetition rate on short outputs (~70% vs ~40%), but the 26b MoE has equally poor JSON validity (0-1/10). The repeated tokens in 26b are also more exotic: "$\text{}$", "visually-cent,", "sing_er," — suggesting more severe token-level corruption.

This makes it less likely to be architecture-specific (Dense vs MoE) and more likely to be a gemma4-generation issue interacting with grammar-constrained sampling.

@rnh0 commented on GitHub (Apr 11, 2026): **Correction:** Our original report stated that gemma4:26b (MoE) did not exhibit the repetition bug (0/3 in our initial test). With more trials this turns out to be wrong. Expanded testing with 10 seeds per test: | Test | gemma4:26b (MoE) Rep / Valid | gemma4:31b (Dense) Rep / Valid | |------|------------------------------|-------------------------------| | Short output | 4/10 / 1/10 | 7/10 / 1/10 | | 1000+ words | 5/10 / 1/10 | (not re-run) | | Complex nested schema | 4/10 / 0/10 | (not re-run) | | 6 free-text fields | 4/10 / 0/10 | (not re-run) | Both model variants are affected. The 31b Dense has a higher repetition rate on short outputs (~70% vs ~40%), but the 26b MoE has equally poor JSON validity (0-1/10). The repeated tokens in 26b are also more exotic: `"$\text{}$"`, `"visually-cent,"`, `"sing_er,"` — suggesting more severe token-level corruption. This makes it less likely to be architecture-specific (Dense vs MoE) and more likely to be a gemma4-generation issue interacting with grammar-constrained sampling.

GiteaMirror commented

2026-05-05 03:10:42 -05:00

@rnh0 commented on GitHub (Apr 11, 2026):

Filed a companion report on the Gemma side: google-deepmind/gemma#622 — covering the model-level token repetition tendency that underlies this bug.

@rnh0 commented on GitHub (Apr 11, 2026): Filed a companion report on the Gemma side: google-deepmind/gemma#622 — covering the model-level token repetition tendency that underlies this bug.

GiteaMirror commented

2026-05-05 03:10:43 -05:00

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15502
Analyzed: 2026-04-18T18:21:25.644373

Analysis

Type: unknown
Severity: medium
Components: unknown

Implementation Plan

Effort: medium
Steps:

This issue has been triaged and marked for implementation.

@PureBlissAK commented on GitHub (Apr 18, 2026):  ## 🤖 Automated Triage & Analysis Report **Issue**: #15502 **Analyzed**: 2026-04-18T18:21:25.644373 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*

GiteaMirror commented

2026-05-05 03:10:44 -05:00

@rnh0 commented on GitHub (Apr 19, 2026):

Update with cross-runtime evidence. Ran the same prompt + JSON-schema repro on alternative runtimes, 10 seeds each with the same unsloth GGUFs Ollama uses:

Runtime / config	Rep	Valid JSON
Ollama + Dense 31B GGUF + `format=schema` (this issue)	10/10	0/10
Ollama + MoE 26B GGUF + `format=schema`	5/10	1/10
llama.cpp-server + Dense 31B same GGUF + `response_format=json_schema`	0/10	10/10
llama.cpp-server + MoE 26B same GGUF + `response_format=json_schema`	0/10	10/10
vLLM + Dense 31B AWQ-4bit + `response_format=json_schema` (default xgrammar)	0/10 word	0/10 (whitespace-pad loop)
vLLM same + `StructuredOutputsConfig(disable_any_whitespace=true)`	0/10	9/10

Same GGUF file that fails 10/10 on Ollama runs 10/10 clean on llama.cpp. The bug is in Ollama's structured-output / grammar path, not in the ggml weights or the GGUF tokenizer.

A related symptom reproduces on vLLM's xgrammar backend (whitespace-pad degenerate loop rather than word-loop — same "low-entropy trap inside grammar's allowed language" class). vLLM fixes it with disable_any_whitespace=true, which forbids arbitrary whitespace between JSON tokens in the grammar.

If Ollama's new GGML engine has an analogous knob in its grammar sampler, that would likely fix this. Filed on vLLM side: vllm-project/vllm#40080 (comment with full matrix). Also cross-linked on google-deepmind/gemma#622.

Repro scripts + raw per-seed outputs: https://gist.github.com/rnh0/e02a668c875af46eb5cb46ab0c77132b

Matrix designed with assistance from Claude Code (Anthropic). All tests run locally, deterministic, independently reproducible.

@rnh0 commented on GitHub (Apr 19, 2026): Update with cross-runtime evidence. Ran the same prompt + JSON-schema repro on alternative runtimes, 10 seeds each with the same unsloth GGUFs Ollama uses: | Runtime / config | Rep | Valid JSON | |---|---|---| | **Ollama** + Dense 31B GGUF + `format=schema` (this issue) | 10/10 | 0/10 | | **Ollama** + MoE 26B GGUF + `format=schema` | 5/10 | 1/10 | | **llama.cpp-server** + Dense 31B **same GGUF** + `response_format=json_schema` | 0/10 | **10/10** | | **llama.cpp-server** + MoE 26B **same GGUF** + `response_format=json_schema` | 0/10 | **10/10** | | **vLLM** + Dense 31B AWQ-4bit + `response_format=json_schema` (default xgrammar) | 0/10 word | 0/10 (whitespace-pad loop) | | **vLLM** same + `StructuredOutputsConfig(disable_any_whitespace=true)` | 0/10 | **9/10** | Same GGUF file that fails 10/10 on Ollama runs 10/10 clean on llama.cpp. The bug is in Ollama's structured-output / grammar path, not in the ggml weights or the GGUF tokenizer. A related symptom reproduces on vLLM's xgrammar backend (whitespace-pad degenerate loop rather than word-loop — same "low-entropy trap inside grammar's allowed language" class). vLLM fixes it with `disable_any_whitespace=true`, which forbids arbitrary whitespace between JSON tokens in the grammar. If Ollama's new GGML engine has an analogous knob in its grammar sampler, that would likely fix this. Filed on vLLM side: vllm-project/vllm#40080 (comment with full matrix). Also cross-linked on google-deepmind/gemma#622. Repro scripts + raw per-seed outputs: https://gist.github.com/rnh0/e02a668c875af46eb5cb46ab0c77132b --- *Matrix designed with assistance from Claude Code (Anthropic). All tests run locally, deterministic, independently reproducible.*

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#71968