[GH-ISSUE #15626] Expose max_soft_tokens (image token budget) as a runtime parameter for Gemma 4 models #56483

Open
opened 2026-04-29 10:53:27 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @somthing3000 on GitHub (Apr 16, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15626

Gemma 4's vision encoder supports a variable-resolution token budget via max_soft_tokens, but this value is currently hardcoded to 280 in model/models/gemma4/process_image.go (see L25–31). There is no way to override it at runtime through the API or via ollama-python library.

Google's own documentation for Gemma 4 explicitly recommends tuning this budget for OCR tasks that require higher image resolution:
https://ai.google.dev/gemma/docs/capabilities/vision/image#variable_resolution_token_budget

The default budget of 280 tokens is insufficient for fine-grained visual tasks. In practice, this causes measurable accuracy regressions. A concrete example using license plate recognition with gemma4-e4b:

Configuration Result Expected
Ollama default (280 token budget) YRSGNB YRSGNBY
HuggingFace Transformers with max_soft_tokens=560 YRSGNBY YRSGNBY

The same degradation is reproducible with the unquantized model when the budget is left at the default, even affecting higher parameter model variants such as 27B. Please refer to this discussion https://github.com/ollama/ollama-python/issues/651.

Originally created by @somthing3000 on GitHub (Apr 16, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15626 Gemma 4's vision encoder supports a variable-resolution token budget via `max_soft_tokens`, but this value is currently hardcoded to `280` in `model/models/gemma4/process_image.go` ([see L25–31](https://github.com/ollama/ollama/blob/7d271e6dc9fb114d48b91a1ed2ed3d414178a883/model/models/gemma4/process_image.go#L25-L31)). There is no way to override it at runtime through the API or via ollama-python library. Google's own documentation for Gemma 4 explicitly recommends tuning this budget for OCR tasks that require higher image resolution: https://ai.google.dev/gemma/docs/capabilities/vision/image#variable_resolution_token_budget The default budget of 280 tokens is insufficient for fine-grained visual tasks. In practice, this causes measurable accuracy regressions. A concrete example using license plate recognition with `gemma4-e4b`: | Configuration | Result | Expected | |---|---|---| | Ollama default (280 token budget) | `YRSGNB` | `YRSGNBY` | | HuggingFace Transformers with `max_soft_tokens=560` | `YRSGNBY` ✓ | `YRSGNBY` | The same degradation is reproducible with the unquantized model when the budget is left at the default, even affecting higher parameter model variants such as 27B. Please refer to this discussion https://github.com/ollama/ollama-python/issues/651.
GiteaMirror added the feature request label 2026-04-29 10:53:27 -05:00
Author
Owner

@seamon67 commented on GitHub (Apr 18, 2026):

This is even mentioned in Ollama's Gemma 4 Model Page - at the very end.

<!-- gh-comment-id:4274205349 --> @seamon67 commented on GitHub (Apr 18, 2026): This is even mentioned in Ollama's [Gemma 4 Model Page](https://ollama.com/library/gemma4) - at the very end.
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15626
Analyzed: 2026-04-18T18:19:42.404408

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274305355 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15626 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15626 **Analyzed**: 2026-04-18T18:19:42.404408 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Author
Owner

@seamon67 commented on GitHub (Apr 18, 2026):

I tried creating a custom build with min and max tokens as 560 and 1120 respectively and the difference in OCR tasks is huge.

<!-- gh-comment-id:4274529449 --> @seamon67 commented on GitHub (Apr 18, 2026): I tried creating a custom build with min and max tokens as 560 and 1120 respectively and the difference in OCR tasks is huge.
Author
Owner

@seamon67 commented on GitHub (Apr 19, 2026):

Update

So I tried a bunch more things:

  • Setting minTokens := 40 and maxTokens := 1120 in process_image.go works but it crashes on some images but gives really good results on others.
  • Changing the Sliding Window to kvcache.NewSWAMemCache(slidingWindowLen, 8192, m.Shift) or kvcache.NewSWAMemCache(2048, 8192, m.Shift) in model.go fixes some of the issues above but leads to a race condition where every once in a while, it will crash. It will work fine on the same image on a retry.
  • I tried a bunch of things but nothing seem to have worked to fix the race condition.
  • Also, a large batch of images (20-50 images in one go) usually leads to a crash which is not related to the race condition.
<!-- gh-comment-id:4276797570 --> @seamon67 commented on GitHub (Apr 19, 2026): ### Update So I tried a bunch more things: - Setting minTokens := 40 and maxTokens := 1120 in process_image.go works but it crashes on some images but gives really good results on others. - Changing the Sliding Window to kvcache.NewSWAMemCache(slidingWindowLen, 8192, m.Shift) or kvcache.NewSWAMemCache(2048, 8192, m.Shift) in model.go fixes some of the issues above but leads to a race condition where every once in a while, it will crash. It will work fine on the same image on a retry. - I tried a bunch of things but nothing seem to have worked to fix the race condition. - Also, a large batch of images (20-50 images in one go) usually leads to a crash which is not related to the race condition.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56483