[GH-ISSUE #14957] Format output for qwen 3.5 35b model does not count thinking tokens as eval #9617

Open
opened 2026-04-12 22:31:12 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @BigArty on GitHub (Mar 19, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14957

What is the issue?

Format output for qwen 3.5 35b model does not count thinking tokens as eval, but adds them as prompt_eval.

This is the request params:

response = client_llm.chat(model="qwen3.5:35b",
messages=messages,
context_length= 40000,
top_p = 0.95,
top_k = 20,
temperature = 1,
repeat_penalty=1.5,
max_tokens=6000,
stream=True,
format=ProcessedTranscription.model_json_schema(),
)

The problem can be identified by changing prompt_eval for the same request running several times.

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

1.18.1

Originally created by @BigArty on GitHub (Mar 19, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14957 ### What is the issue? Format output for qwen 3.5 35b model does not count thinking tokens as eval, but adds them as prompt_eval. This is the request params: response = client_llm.chat(model="qwen3.5:35b", messages=messages, context_length= 40000, top_p = 0.95, top_k = 20, temperature = 1, repeat_penalty=1.5, max_tokens=6000, stream=True, format=ProcessedTranscription.model_json_schema(), ) The problem can be identified by changing prompt_eval for the same request running several times. ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 1.18.1
GiteaMirror added the bug label 2026-04-12 22:31:12 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 19, 2026):

Inference done with thinking models and structured output is done in two passes. The first accumulates thinking tokens without applying the structured output restriction, the second takes the output of the first and runs inference while applying the structured output restriction. The code should save the prompt_eval_count/eval_count fields after the first pass, and then add the eval_count of the second pass to the stored value. The stored values are then used to update the usage statistics of the generation request.

<!-- gh-comment-id:4092611145 --> @rick-github commented on GitHub (Mar 19, 2026): Inference done with thinking models and structured output is done in two passes. The first accumulates thinking tokens without applying the structured output restriction, the second takes the output of the first and runs inference while applying the structured output restriction. The [code](https://github.com/ollama/ollama/blob/126d8db7f3ad151ead9fb588f3a56cd5f5c9c13b/server/routes.go#L2374) should save the prompt_eval_count/eval_count fields after the first pass, and then add the eval_count of the second pass to the stored value. The stored values are then used to update the usage statistics of the generation request.
Author
Owner

@BigArty commented on GitHub (Mar 20, 2026):

The main problem for me is that I can't estimate the reasonable max_tokens limit for the generation. Based on the run with 12000 I saw the maximum of 40 requests to be at 3k. So I lowered the max tokens to 6k and the next run I get more then 50% empty replies because the model have not finished thinking.

<!-- gh-comment-id:4096528284 --> @BigArty commented on GitHub (Mar 20, 2026): The main problem for me is that I can't estimate the reasonable max_tokens limit for the generation. Based on the run with 12000 I saw the maximum of 40 requests to be at 3k. So I lowered the max tokens to 6k and the next run I get more then 50% empty replies because the model have not finished thinking.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9617