[GH-ISSUE #15502] gemma4:31b repetition loop during constrained JSON generation with free-text string fields #71968

Open
opened 2026-05-05 03:10:36 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @rnh0 on GitHub (Apr 11, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15502

Summary

gemma4:31b enters word-repetition loops when generating long free-text strings inside a JSON schema constraint (format=). A word doubles, then collapses into a single repeated token that fills the remaining num_predict budget, leaving the JSON unterminated. Bug rate is 60-100% depending on the prompt, across 39 trials.

This is NOT the <unused> token / GEMV buffer overlap bug (llama.cpp#21321 / #21566). The repeated tokens are normal English words, and zero <unused> tokens were observed in any trial.

Observed behavior

Actual output from the minimal repro below (seed=0):

{
  "description": "The horizon is a bleeding, vibrant, orange-gold single own line where the sky meets
the ocean, as the sun dips low, casting a long, shimmering path of amber own own own own own own own
own own own own own own own own own own own own own own own own own own own own own own own own own

(300 chars, unterminated JSON — the "own" token repeats until num_predict is exhausted.)

The cascade pattern:

  1. Normal generation starts fine inside a JSON string value
  2. A token intrudes: "amber own"
  3. Collapses into a single repeated token: "own own own own own..."
  4. Fills remaining num_predict budget (8192 tokens)
  5. JSON left unterminated -> parse error

Root cause isolation

We ran 39 trials across 13 test configurations, varying one condition at a time. Three conditions are all required to trigger the bug:

# Test Rep Bugs JSON Fail What it proves
1 4 different prompts + schema + free-text 8/12 10/12 Not prompt-specific (60-100% rate)
2 no format= (free generation) 0/3 0/3 format= IS required
3 schema + no free-text fields 0/3 0/3 Free-text strings in JSON trigger it
4 schema + free-text + think=False 0/3 3/3 No repetition, but JSON broken (#15260)
5a gemma4:26b (MoE) + schema + free-text 0/3 3/3 Dense (31b) only, MoE has different JSON issues
5b gemma3:27b + schema + free-text 0/3 0/3 gemma4-specific regression
6a repeat_penalty=1.0 2/3 2/3 Penalty has no effect
6b repeat_penalty=1.15 2/3 2/3 Same seeds fail regardless
6c repeat_penalty=1.5 2/3 2/3 Cannot suppress the cascade

The three necessary conditions

  1. gemma4:31b (Dense) — gemma4:26b (MoE) and gemma3:27b do not exhibit this bug
  2. format= with a JSON schema — removing the grammar constraint eliminates the bug entirely
  3. Free-text string fields in the schema (e.g., "description": {"type": "string"} requesting multi-sentence output) — a simple schema with only arrays and enums is clean

Vision input is not required — text-only prompts reproduce at the same rate.

Note: The test matrix was collected using a longer prompt with vision input variants. The same bug reproduces with the simplified text-only prompt shown below. The minimal repro was verified independently.

Minimal reproduction

No images, no dependencies beyond the ollama Python package. Seeds 0 and 84 hit the repetition loop; seed 42 produces malformed JSON of a different kind. All 3/3 seeds produce broken output.

import ollama

SCHEMA = {
    "type": "object",
    "required": ["description", "analysis", "tags"],
    "properties": {
        "description": {
            "type": "string",
            "description": "At least 3 detailed sentences.",
        },
        "analysis": {
            "type": "string",
            "description": "Several paragraphs of analysis.",
        },
        "tags": {"type": "array", "items": {"type": "string"}},
    },
}

PROMPT = (
    "Describe a beach scene at sunset in detail. "
    "Write at least 3 full sentences for description "
    "and several paragraphs for analysis."
)

response = ollama.chat(
    model="gemma4:31b",
    messages=[{"role": "user", "content": PROMPT}],
    format=SCHEMA,
    options={
        "num_ctx": 32768,
        "num_predict": 8192,
        "repeat_penalty": 1.15,
        "repeat_last_n": 256,
        "seed": 0,
    },
)
content = response.message.content
print(f"Length: {len(content)} chars")
print(f"Tail: ...{content[-200:]}")

Expected output

Valid JSON (~500-2000 chars) with description, analysis, and tags fields, properly terminated.

Actual output

Unterminated JSON (300 chars with seed=0, up to ~33,000 chars with other seeds). The description or analysis field enters a repetition loop partway through and fills the remaining token budget.

Expected vs actual behavior

Expected Actual
Output Valid, terminated JSON with multi-sentence free-text fields Word repetition loop fills num_predict budget, JSON unterminated
repeat_penalty Higher values should suppress repetition No effect at any tested value (1.0, 1.15, 1.5) — same seeds fail identically
Grammar constraint Should enforce valid JSON structure Grammar allows the repeated word because it's a valid string character sequence

System info

Component Value
GPU NVIDIA GeForce RTX 5090 (32 GB VRAM)
Driver (running kernel module) 580.126.16
CUDA Version 13.0
Ollama 0.20.5
OS Ubuntu 24.04.3 LTS
Kernel 6.17.0-14-generic x86_64
CPU AMD Ryzen 7 9800X3D
Model gemma4:31b (SHA 6316f0629137, 19 GB)

Additional context

Why repeat_penalty has no effect

We tested repeat_penalty at 1.0, 1.15, and 1.5 — identical seeds fail identically at all values. Our hypothesis: the grammar constraint limits token choices at each step, and inside a JSON string value any valid string content (including word repetition) is allowed. If the model's logit distribution degenerates to strongly favor a single token, the grammar has no mechanism to reject it regardless of penalty strength.

Interaction with ollama#15260

Test 5 (think=False + format=) produced 0/3 repetition bugs but 3/3 JSON failures (output was plain text, not JSON). This confirms #15260: when thinking is disabled, the format constraint is never applied because the end-of-thinking token never fires. This accidentally "fixes" the repetition bug by removing the grammar constraint entirely — but breaks structured output.

gemma4:26b (MoE) behavior

The MoE variant produced 0/3 repetition bugs but 3/3 JSON failures of a different kind (malformed JSON, not repetition loops). The MoE model has separate structured output issues that may be related to #15428.

Not the GEMV buffer overlap bug

Zero <unused> tokens were observed across all 39 trials. The repeated tokens are normal English words ("beach", "own", "same", "companion", "fatigue,en"). This is a distinct bug from the CUDA GEMV fusion buffer overlap fixed in llama.cpp#21566 / b8702.


Tested on 2026-04-11. 39 trials across 13 test configurations.

The test matrix and isolation methodology were designed with assistance from Claude Code (Anthropic). All tests were run locally on the hardware described above. Results are deterministic and independently reproducible.

Originally created by @rnh0 on GitHub (Apr 11, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15502 ## Summary `gemma4:31b` enters word-repetition loops when generating long free-text strings inside a JSON schema constraint (`format=`). A word doubles, then collapses into a single repeated token that fills the remaining `num_predict` budget, leaving the JSON unterminated. Bug rate is **60-100%** depending on the prompt, across 39 trials. **This is NOT the `<unused>` token / GEMV buffer overlap bug** (llama.cpp#21321 / #21566). The repeated tokens are normal English words, and zero `<unused>` tokens were observed in any trial. ## Observed behavior Actual output from the minimal repro below (`seed=0`): ```json { "description": "The horizon is a bleeding, vibrant, orange-gold single own line where the sky meets the ocean, as the sun dips low, casting a long, shimmering path of amber own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own ``` (300 chars, unterminated JSON — the `"own"` token repeats until `num_predict` is exhausted.) The cascade pattern: 1. Normal generation starts fine inside a JSON string value 2. A token intrudes: `"amber own"` 3. Collapses into a single repeated token: `"own own own own own..."` 4. Fills remaining `num_predict` budget (8192 tokens) 5. JSON left unterminated -> parse error ## Root cause isolation We ran 39 trials across 13 test configurations, varying one condition at a time. Three conditions are **all required** to trigger the bug: | # | Test | Rep Bugs | JSON Fail | What it proves | |---|------|----------|-----------|----------------| | 1 | 4 different prompts + schema + free-text | 8/12 | 10/12 | Not prompt-specific (60-100% rate) | | 2 | **no format=** (free generation) | **0/3** | 0/3 | **format= IS required** | | 3 | schema + **no free-text fields** | **0/3** | 0/3 | **Free-text strings in JSON trigger it** | | 4 | schema + free-text + **think=False** | **0/3** | 3/3 | No repetition, but JSON broken (#15260) | | 5a | **gemma4:26b** (MoE) + schema + free-text | **0/3** | 3/3 | **Dense (31b) only**, MoE has different JSON issues | | 5b | **gemma3:27b** + schema + free-text | **0/3** | 0/3 | **gemma4-specific regression** | | 6a | repeat_penalty=**1.0** | 2/3 | 2/3 | Penalty has no effect | | 6b | repeat_penalty=**1.15** | 2/3 | 2/3 | Same seeds fail regardless | | 6c | repeat_penalty=**1.5** | 2/3 | 2/3 | Cannot suppress the cascade | ### The three necessary conditions 1. **`gemma4:31b` (Dense)** — gemma4:26b (MoE) and gemma3:27b do not exhibit this bug 2. **`format=` with a JSON schema** — removing the grammar constraint eliminates the bug entirely 3. **Free-text string fields in the schema** (e.g., `"description": {"type": "string"}` requesting multi-sentence output) — a simple schema with only arrays and enums is clean Vision input is **not** required — text-only prompts reproduce at the same rate. *Note: The test matrix was collected using a longer prompt with vision input variants. The same bug reproduces with the simplified text-only prompt shown below. The minimal repro was verified independently.* ## Minimal reproduction No images, no dependencies beyond the `ollama` Python package. Seeds 0 and 84 hit the repetition loop; seed 42 produces malformed JSON of a different kind. All 3/3 seeds produce broken output. ```python import ollama SCHEMA = { "type": "object", "required": ["description", "analysis", "tags"], "properties": { "description": { "type": "string", "description": "At least 3 detailed sentences.", }, "analysis": { "type": "string", "description": "Several paragraphs of analysis.", }, "tags": {"type": "array", "items": {"type": "string"}}, }, } PROMPT = ( "Describe a beach scene at sunset in detail. " "Write at least 3 full sentences for description " "and several paragraphs for analysis." ) response = ollama.chat( model="gemma4:31b", messages=[{"role": "user", "content": PROMPT}], format=SCHEMA, options={ "num_ctx": 32768, "num_predict": 8192, "repeat_penalty": 1.15, "repeat_last_n": 256, "seed": 0, }, ) content = response.message.content print(f"Length: {len(content)} chars") print(f"Tail: ...{content[-200:]}") ``` ### Expected output Valid JSON (~500-2000 chars) with `description`, `analysis`, and `tags` fields, properly terminated. ### Actual output Unterminated JSON (300 chars with `seed=0`, up to ~33,000 chars with other seeds). The `description` or `analysis` field enters a repetition loop partway through and fills the remaining token budget. ## Expected vs actual behavior | | Expected | Actual | |---|----------|--------| | **Output** | Valid, terminated JSON with multi-sentence free-text fields | Word repetition loop fills num_predict budget, JSON unterminated | | **repeat_penalty** | Higher values should suppress repetition | No effect at any tested value (1.0, 1.15, 1.5) — same seeds fail identically | | **Grammar constraint** | Should enforce valid JSON structure | Grammar allows the repeated word because it's a valid string character sequence | ## System info | Component | Value | |-----------|-------| | GPU | NVIDIA GeForce RTX 5090 (32 GB VRAM) | | Driver (running kernel module) | 580.126.16 | | CUDA Version | 13.0 | | Ollama | **0.20.5** | | OS | Ubuntu 24.04.3 LTS | | Kernel | 6.17.0-14-generic x86_64 | | CPU | AMD Ryzen 7 9800X3D | | Model | gemma4:31b (SHA `6316f0629137`, 19 GB) | ## Additional context ### Why repeat_penalty has no effect We tested `repeat_penalty` at 1.0, 1.15, and 1.5 — identical seeds fail identically at all values. Our hypothesis: the grammar constraint limits token choices at each step, and inside a JSON string value any valid string content (including word repetition) is allowed. If the model's logit distribution degenerates to strongly favor a single token, the grammar has no mechanism to reject it regardless of penalty strength. ### Interaction with ollama#15260 Test 5 (`think=False` + `format=`) produced 0/3 repetition bugs but 3/3 JSON failures (output was plain text, not JSON). This confirms #15260: when thinking is disabled, the format constraint is never applied because the end-of-thinking token never fires. This accidentally "fixes" the repetition bug by removing the grammar constraint entirely — but breaks structured output. ### gemma4:26b (MoE) behavior The MoE variant produced 0/3 repetition bugs but 3/3 JSON failures of a different kind (malformed JSON, not repetition loops). The MoE model has separate structured output issues that may be related to #15428. ### Not the GEMV buffer overlap bug Zero `<unused>` tokens were observed across all 39 trials. The repeated tokens are normal English words (`"beach"`, `"own"`, `"same"`, `"companion"`, `"fatigue,en"`). This is a distinct bug from the CUDA GEMV fusion buffer overlap fixed in llama.cpp#21566 / b8702. ## Related issues - ollama/ollama#15260 — `think=false` breaks `format=` (format constraint silently ignored) - ollama/ollama#15386 — Structured output contradicts model's own thinking (constrained decoding vs thinking tension) - ollama/ollama#15350 — Flash Attention hangs on gemma4:31b Dense (different bug, same model) - ollama/ollama#15428 — gemma4:26b empty response with long system prompts - ggml-org/llama.cpp#21321 — Gemma 4 `<unused24>` tokens (GEMV buffer overlap — different root cause) - ggml-org/llama.cpp#21566 — Fix for GEMV buffer overlap (does NOT fix this bug) --- *Tested on 2026-04-11. 39 trials across 13 test configurations.* *The test matrix and isolation methodology were designed with assistance from Claude Code (Anthropic). All tests were run locally on the hardware described above. Results are deterministic and independently reproducible.*
Author
Owner

@rick-github commented on GitHub (Apr 11, 2026):

--- 15502.py.orig	2026-04-11 17:11:08.099297197 +0200
+++ 15502.py	2026-04-11 17:09:47.205267951 +0200
@@ -1,4 +1,5 @@
 import ollama
+import json
 
 SCHEMA = {
     "type": "object",
@@ -20,6 +21,7 @@
     "Describe a beach scene at sunset in detail. "
     "Write at least 3 full sentences for description "
     "and several paragraphs for analysis."
+    "\nReturn a JSON structure with this schema: " + json.dumps(SCHEMA)
 )
 
 response = ollama.chat(

The model doesn't know that the tokens it's going to generate are going to be constrained, so the probability distribution is not prepared for the first token to be restricted to {, which is a fairly low probability token in an output talking about a beach. By telling the model the expected output format, it can better prepare.

{
  "description": "The horizon is painted in vibrant hues of molten gold, deep violet, and burnt orange as the sun dips slowly below the waterline. Gentle, rhythmic waves lap against the powdery white sand, leaving behind a shimmering mirror of wet shoreline that reflects the kaleidoscope of the sky. A few scattered seashells and a lone, weathered driftwood log lie nestled in the tide's reach, while the salty breeze carries the distant, melodic cry of a departing seagull.",
  "analysis": "The scene described is a classic study in chromatic contrast and atmospheric peace. By utilizing a color palette of 'molten gold' and 'deep violet,' the description evokes a sense of luxury and transition, marking the boundary between the luminosity of day and the mystery of night. The focus on rhythmic waves and the 'shimmering mirror' emphasizes a theme of duality—the physical world meeting its own reflection—which suggests a moment of introspection and stillness.\n\nFrom a structural standpoint, the sensory details are carefully curated to move from the macroscopic to the microscopic. The narrative begins with the vastness of the horizon, narrows down to the interaction of water and sand, and finally rests on small, tangible objects like seashells and driftwood. This narrowing focus mimics the act of observing a a peaceful environment, grounding the viewer in the a localized reality while maintaining an awareness of the cosmic scale of the sunset.\n\nFurthermore, the auditory elements, such as the 'melodic cry' of the seagull and the 'rhythmic' sound of the waves, serve to fill the silence without disrupting it. These sounds provide a rhythmic cadence to the scene, reinforcing the idea that nature operates on a predictable, soothing cycle. The overall effect is one of transcendental tranquility, where the observer is invited to pause and experience the ephemeral beauty of a fleeting moment.",
  "tags": [
    "nature",
    "sunset",
    "beach",
    "atmospheric",
    "descriptive writing"
  ]
}
<!-- gh-comment-id:4229661563 --> @rick-github commented on GitHub (Apr 11, 2026): ```diff --- 15502.py.orig 2026-04-11 17:11:08.099297197 +0200 +++ 15502.py 2026-04-11 17:09:47.205267951 +0200 @@ -1,4 +1,5 @@ import ollama +import json SCHEMA = { "type": "object", @@ -20,6 +21,7 @@ "Describe a beach scene at sunset in detail. " "Write at least 3 full sentences for description " "and several paragraphs for analysis." + "\nReturn a JSON structure with this schema: " + json.dumps(SCHEMA) ) response = ollama.chat( ``` The model doesn't know that the tokens it's going to generate are going to be constrained, so the probability distribution is not prepared for the first token to be restricted to `{`, which is a fairly low probability token in an output talking about a beach. By telling the model the expected output format, it can better prepare. ```json { "description": "The horizon is painted in vibrant hues of molten gold, deep violet, and burnt orange as the sun dips slowly below the waterline. Gentle, rhythmic waves lap against the powdery white sand, leaving behind a shimmering mirror of wet shoreline that reflects the kaleidoscope of the sky. A few scattered seashells and a lone, weathered driftwood log lie nestled in the tide's reach, while the salty breeze carries the distant, melodic cry of a departing seagull.", "analysis": "The scene described is a classic study in chromatic contrast and atmospheric peace. By utilizing a color palette of 'molten gold' and 'deep violet,' the description evokes a sense of luxury and transition, marking the boundary between the luminosity of day and the mystery of night. The focus on rhythmic waves and the 'shimmering mirror' emphasizes a theme of duality—the physical world meeting its own reflection—which suggests a moment of introspection and stillness.\n\nFrom a structural standpoint, the sensory details are carefully curated to move from the macroscopic to the microscopic. The narrative begins with the vastness of the horizon, narrows down to the interaction of water and sand, and finally rests on small, tangible objects like seashells and driftwood. This narrowing focus mimics the act of observing a a peaceful environment, grounding the viewer in the a localized reality while maintaining an awareness of the cosmic scale of the sunset.\n\nFurthermore, the auditory elements, such as the 'melodic cry' of the seagull and the 'rhythmic' sound of the waves, serve to fill the silence without disrupting it. These sounds provide a rhythmic cadence to the scene, reinforcing the idea that nature operates on a predictable, soothing cycle. The overall effect is one of transcendental tranquility, where the observer is invited to pause and experience the ephemeral beauty of a fleeting moment.", "tags": [ "nature", "sunset", "beach", "atmospheric", "descriptive writing" ] } ```
Author
Owner

@rnh0 commented on GitHub (Apr 11, 2026):

Thanks for the quick reply and suggestion! Confirmed — including the schema in the prompt does significantly improve the toy example from the original report (0/5 repetition loops, 5/5 valid JSON vs. 2/3 and 0/3 without it).

However, stress testing shows the improvement doesn't hold for more demanding use cases.

For each test below, the prompt includes "\nReturn a JSON structure with this schema: " + json.dumps(SCHEMA) as you suggested, and format=SCHEMA is set. 10 seeds per test, same options as the original report.

Test Repetition loops Valid JSON
Short output (original repro) 0/5 5/5
1000+ words requested 10/10 0/10
Complex nested schema (5 fields, nested objects, enums) 7/8 1/8
6 paragraph-length free-text fields 3/10 7/10
Vision input + schema in prompt 3/10 4/10
Minimal hint ("Respond in JSON." instead of full schema) 4/10 5/10

The workaround helps the model prepare for JSON output during thinking, which is enough for short responses. But the underlying degeneration still occurs during sustained free-text generation inside JSON strings — the model's logit distribution still collapses into single-token repetition regardless of prompt priming.

The word doubling that appears even in successful outputs ("a a", "the the", "sapphire sapphire") seems to be the precursor. For short outputs, generation ends before it cascades. For longer outputs, it inevitably does.

Stress test script: https://gist.github.com/rnh0/18a7f25c70da00c8e47235e849bc5798

Note: the test crashed partway through when ollama hung during model unload ("Stopping..." state). The complex schema result is 8 trials instead of 10.

<!-- gh-comment-id:4229948218 --> @rnh0 commented on GitHub (Apr 11, 2026): Thanks for the quick reply and suggestion! Confirmed — including the schema in the prompt does significantly improve the toy example from the original report (0/5 repetition loops, 5/5 valid JSON vs. 2/3 and 0/3 without it). However, stress testing shows the improvement doesn't hold for more demanding use cases. For each test below, the prompt includes `"\nReturn a JSON structure with this schema: " + json.dumps(SCHEMA)` as you suggested, and `format=SCHEMA` is set. 10 seeds per test, same options as the original report. | Test | Repetition loops | Valid JSON | |------|-----------------|------------| | Short output (original repro) | 0/5 | 5/5 | | **1000+ words requested** | **10/10** | 0/10 | | Complex nested schema (5 fields, nested objects, enums) | 7/8 | 1/8 | | 6 paragraph-length free-text fields | 3/10 | 7/10 | | Vision input + schema in prompt | 3/10 | 4/10 | | Minimal hint (`"Respond in JSON."` instead of full schema) | 4/10 | 5/10 | The workaround helps the model prepare for JSON output during thinking, which is enough for short responses. But the underlying degeneration still occurs during sustained free-text generation inside JSON strings — the model's logit distribution still collapses into single-token repetition regardless of prompt priming. The word doubling that appears even in successful outputs (`"a a"`, `"the the"`, `"sapphire sapphire"`) seems to be the precursor. For short outputs, generation ends before it cascades. For longer outputs, it inevitably does. Stress test script: https://gist.github.com/rnh0/18a7f25c70da00c8e47235e849bc5798 Note: the test crashed partway through when ollama hung during model unload ("Stopping..." state). The complex schema result is 8 trials instead of 10.
Author
Owner

@rnh0 commented on GitHub (Apr 11, 2026):

Correction: Our original report stated that gemma4:26b (MoE) did not exhibit the repetition bug (0/3 in our initial test). With more trials this turns out to be wrong.

Expanded testing with 10 seeds per test:

Test gemma4:26b (MoE) Rep / Valid gemma4:31b (Dense) Rep / Valid
Short output 4/10 / 1/10 7/10 / 1/10
1000+ words 5/10 / 1/10 (not re-run)
Complex nested schema 4/10 / 0/10 (not re-run)
6 free-text fields 4/10 / 0/10 (not re-run)

Both model variants are affected. The 31b Dense has a higher repetition rate on short outputs (~70% vs ~40%), but the 26b MoE has equally poor JSON validity (0-1/10). The repeated tokens in 26b are also more exotic: "$\text{}$", "visually-cent,", "sing_er," — suggesting more severe token-level corruption.

This makes it less likely to be architecture-specific (Dense vs MoE) and more likely to be a gemma4-generation issue interacting with grammar-constrained sampling.

<!-- gh-comment-id:4230161059 --> @rnh0 commented on GitHub (Apr 11, 2026): **Correction:** Our original report stated that gemma4:26b (MoE) did not exhibit the repetition bug (0/3 in our initial test). With more trials this turns out to be wrong. Expanded testing with 10 seeds per test: | Test | gemma4:26b (MoE) Rep / Valid | gemma4:31b (Dense) Rep / Valid | |------|------------------------------|-------------------------------| | Short output | 4/10 / 1/10 | 7/10 / 1/10 | | 1000+ words | 5/10 / 1/10 | (not re-run) | | Complex nested schema | 4/10 / 0/10 | (not re-run) | | 6 free-text fields | 4/10 / 0/10 | (not re-run) | Both model variants are affected. The 31b Dense has a higher repetition rate on short outputs (~70% vs ~40%), but the 26b MoE has equally poor JSON validity (0-1/10). The repeated tokens in 26b are also more exotic: `"$\text{}$"`, `"visually-cent,"`, `"sing_er,"` — suggesting more severe token-level corruption. This makes it less likely to be architecture-specific (Dense vs MoE) and more likely to be a gemma4-generation issue interacting with grammar-constrained sampling.
Author
Owner

@rnh0 commented on GitHub (Apr 11, 2026):

Filed a companion report on the Gemma side: google-deepmind/gemma#622 — covering the model-level token repetition tendency that underlies this bug.

<!-- gh-comment-id:4230164989 --> @rnh0 commented on GitHub (Apr 11, 2026): Filed a companion report on the Gemma side: google-deepmind/gemma#622 — covering the model-level token repetition tendency that underlies this bug.
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15502
Analyzed: 2026-04-18T18:21:25.644373

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274308274 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15502 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15502 **Analyzed**: 2026-04-18T18:21:25.644373 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Author
Owner

@rnh0 commented on GitHub (Apr 19, 2026):

Update with cross-runtime evidence. Ran the same prompt + JSON-schema repro on alternative runtimes, 10 seeds each with the same unsloth GGUFs Ollama uses:

Runtime / config Rep Valid JSON
Ollama + Dense 31B GGUF + format=schema (this issue) 10/10 0/10
Ollama + MoE 26B GGUF + format=schema 5/10 1/10
llama.cpp-server + Dense 31B same GGUF + response_format=json_schema 0/10 10/10
llama.cpp-server + MoE 26B same GGUF + response_format=json_schema 0/10 10/10
vLLM + Dense 31B AWQ-4bit + response_format=json_schema (default xgrammar) 0/10 word 0/10 (whitespace-pad loop)
vLLM same + StructuredOutputsConfig(disable_any_whitespace=true) 0/10 9/10

Same GGUF file that fails 10/10 on Ollama runs 10/10 clean on llama.cpp. The bug is in Ollama's structured-output / grammar path, not in the ggml weights or the GGUF tokenizer.

A related symptom reproduces on vLLM's xgrammar backend (whitespace-pad degenerate loop rather than word-loop — same "low-entropy trap inside grammar's allowed language" class). vLLM fixes it with disable_any_whitespace=true, which forbids arbitrary whitespace between JSON tokens in the grammar.

If Ollama's new GGML engine has an analogous knob in its grammar sampler, that would likely fix this. Filed on vLLM side: vllm-project/vllm#40080 (comment with full matrix). Also cross-linked on google-deepmind/gemma#622.

Repro scripts + raw per-seed outputs: https://gist.github.com/rnh0/e02a668c875af46eb5cb46ab0c77132b


Matrix designed with assistance from Claude Code (Anthropic). All tests run locally, deterministic, independently reproducible.

<!-- gh-comment-id:4275726495 --> @rnh0 commented on GitHub (Apr 19, 2026): Update with cross-runtime evidence. Ran the same prompt + JSON-schema repro on alternative runtimes, 10 seeds each with the same unsloth GGUFs Ollama uses: | Runtime / config | Rep | Valid JSON | |---|---|---| | **Ollama** + Dense 31B GGUF + `format=schema` (this issue) | 10/10 | 0/10 | | **Ollama** + MoE 26B GGUF + `format=schema` | 5/10 | 1/10 | | **llama.cpp-server** + Dense 31B **same GGUF** + `response_format=json_schema` | 0/10 | **10/10** | | **llama.cpp-server** + MoE 26B **same GGUF** + `response_format=json_schema` | 0/10 | **10/10** | | **vLLM** + Dense 31B AWQ-4bit + `response_format=json_schema` (default xgrammar) | 0/10 word | 0/10 (whitespace-pad loop) | | **vLLM** same + `StructuredOutputsConfig(disable_any_whitespace=true)` | 0/10 | **9/10** | Same GGUF file that fails 10/10 on Ollama runs 10/10 clean on llama.cpp. The bug is in Ollama's structured-output / grammar path, not in the ggml weights or the GGUF tokenizer. A related symptom reproduces on vLLM's xgrammar backend (whitespace-pad degenerate loop rather than word-loop — same "low-entropy trap inside grammar's allowed language" class). vLLM fixes it with `disable_any_whitespace=true`, which forbids arbitrary whitespace between JSON tokens in the grammar. If Ollama's new GGML engine has an analogous knob in its grammar sampler, that would likely fix this. Filed on vLLM side: vllm-project/vllm#40080 (comment with full matrix). Also cross-linked on google-deepmind/gemma#622. Repro scripts + raw per-seed outputs: https://gist.github.com/rnh0/e02a668c875af46eb5cb46ab0c77132b --- *Matrix designed with assistance from Claude Code (Anthropic). All tests run locally, deterministic, independently reproducible.*
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71968