[GH-ISSUE #11538] Qwen3:14b not using <tool_call> and calling functions with plaintext #7613

Closed
opened 2026-04-12 19:42:22 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @maxjo020418 on GitHub (Jul 26, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11538

Originally assigned to: @jmorganca on GitHub.

What is the issue?

when using /v1/chat/completions with Qwen3:14b, after some amount of chat history is accumulated, the model starts doing tool calls in plaintext.

I kinda suspect it being related to context size limits since it works fine until a certain amount of chat history is accumulated and also if I change the context size to be bigger(~8k) it's fixed (up to a point... after that it does the same.)

but the raw chatml capture related to function call description and such seems intact since system prompts seems to be preserved even though other histories are purged:

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

(plus, it seems to know what functions there are so it is digesting <tool> info properly?)

These responses are like this:

<think> Okay, the user is asking to list all articles again to check. Let me see which function to use. The available functions are index, create, update, and erase. The index function is designed to list all articles, so that's the one I need. The user's request is straightforward—they want to see all articles, so I should call the index function. No parameters are needed for index, so the arguments will be empty. I'll format the selectFunctions call correctly with the index function and an empty arguments object. </think>

{"name": "selectFunctions", "arguments": {"functions": [{"name": "index", "reason": "The user wants to list all articles to check, which requires the 'index' function."}]}}

with no <tool_call>

the file attached is one example of the model constantly spewing out like this
20250726_223211_323072.json

Is this some simple model hallucination/forgetting orders? or is there more? This is kinda problematic cause it's pretty consistent.

Thank you!

Relevant log output


OS

WSL2

GPU

Nvidia

CPU

AMD

Ollama version

0.9.6

Originally created by @maxjo020418 on GitHub (Jul 26, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11538 Originally assigned to: @jmorganca on GitHub. ### What is the issue? when using `/v1/chat/completions` with Qwen3:14b, after some amount of chat history is accumulated, the model starts doing tool calls in plaintext. I kinda suspect it being related to context size limits since it works fine until a certain amount of chat history is accumulated and also if I change the context size to be bigger(~8k) it's fixed (up to a point... after that it does the same.) but the raw chatml capture related to function call description and such seems intact since system prompts seems to be preserved even though other histories are purged: ``` For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call> {"name": <function-name>, "arguments": <args-json-object>} </tool_call> ``` (plus, it seems to know what functions there are so it is digesting `<tool>` info properly?) These responses are like this: ``` <think> Okay, the user is asking to list all articles again to check. Let me see which function to use. The available functions are index, create, update, and erase. The index function is designed to list all articles, so that's the one I need. The user's request is straightforward—they want to see all articles, so I should call the index function. No parameters are needed for index, so the arguments will be empty. I'll format the selectFunctions call correctly with the index function and an empty arguments object. </think> {"name": "selectFunctions", "arguments": {"functions": [{"name": "index", "reason": "The user wants to list all articles to check, which requires the 'index' function."}]}} ``` with no `<tool_call>` the file attached is one example of the model constantly spewing out like this [20250726_223211_323072.json](https://github.com/user-attachments/files/21445891/20250726_223211_323072.json) Is this some simple model hallucination/forgetting orders? or is there more? This is kinda problematic cause it's pretty consistent. Thank you! ### Relevant log output ```shell ``` ### OS WSL2 ### GPU Nvidia ### CPU AMD ### Ollama version 0.9.6
GiteaMirror added the bug label 2026-04-12 19:42:22 -05:00
Author
Owner

@jmorganca commented on GitHub (Jul 26, 2025):

@maxjo020418 sorry you hit this issue. It's most likely because the context window isn't long enough and so the quality degrades over time. Have you tried setting OLLAMA_CONTEXT_LENGTH (via setting environment variables) or num_ctx (if using the API)

<!-- gh-comment-id:3122173947 --> @jmorganca commented on GitHub (Jul 26, 2025): @maxjo020418 sorry you hit this issue. It's most likely because the context window isn't long enough and so the quality degrades over time. Have you tried setting `OLLAMA_CONTEXT_LENGTH` (via [setting environment variables](https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-linux)) or `num_ctx` (if using the API)
Author
Owner

@maxjo020418 commented on GitHub (Jul 27, 2025):

Yes as i've mentioned I increased the context size which works for a bit but then it does that again.

I guess it's a model limitation Rather than Ollama's?

<!-- gh-comment-id:3123780201 --> @maxjo020418 commented on GitHub (Jul 27, 2025): Yes as i've mentioned I increased the context size which works for a bit but then it does that again. I guess it's a model limitation Rather than Ollama's?
Author
Owner

@rick-github commented on GitHub (Jul 27, 2025):

I created a model with:

echo FROM qwen3:14b > Modelfile
echo PARAMETER num_ctx 8192 >> Modelfile
ollama create qwen3-14b-8k

and ran the JSON example 100 times with 100% success rate:

$ for i in {1..100} ; do curl -s localhost:11434/v1/chat/completions -d @20250726_223211_323072.json | jq '.choices[0].message.tool_calls[0].function.name' ; done | sort | uniq -c
    100 "selectFunctions"

I think it's possible that your context is not as large as you think. The prompt is 4699 tokens in length, a context buffer of 4096 tokens (the default) will cause the tool definitions and tool call formatting instructions (that include the information about <tool_call></tool_call>) to be lost. However, more recent messages in the message list tell the model what tools are available and how to format a tool call, so the model does its best, returning a tool call without the tool_call wrapper.

Server logs may aid in debugging.

<!-- gh-comment-id:3124073217 --> @rick-github commented on GitHub (Jul 27, 2025): I created a model with: ``` echo FROM qwen3:14b > Modelfile echo PARAMETER num_ctx 8192 >> Modelfile ollama create qwen3-14b-8k ``` and ran the JSON example 100 times with 100% success rate: ```console $ for i in {1..100} ; do curl -s localhost:11434/v1/chat/completions -d @20250726_223211_323072.json | jq '.choices[0].message.tool_calls[0].function.name' ; done | sort | uniq -c 100 "selectFunctions" ``` I think it's possible that your context is not as large as you think. The prompt is 4699 tokens in length, a context buffer of 4096 tokens (the default) will cause the tool definitions and tool call formatting instructions (that include the information about `<tool_call></tool_call>`) to be lost. However, more recent messages in the message list tell the model what tools are available and how to format a tool call, so the model does its best, returning a tool call without the `tool_call` wrapper. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@maxjo020418 commented on GitHub (Jul 28, 2025):

I see, I'll check on that and update
Thank you

<!-- gh-comment-id:3126957626 --> @maxjo020418 commented on GitHub (Jul 28, 2025): I see, I'll check on that and update Thank you
Author
Owner

@maxjo020418 commented on GitHub (Jul 29, 2025):

Okay, after some observation, some extra large chunks of function call/response and chat histories seems to be causing this.

Ollama does try to cut the history to fit the context window,
but either function call histories are not being removed enough (long function call records filling the history up)
or Ollama conservatively cuts out histories that it exceeds the context window a bit occasionally
(especially if the earliest history that wasn't removed is a large chunk)

For now, I might cut down on function call history and other parts to reduce context (can't increase context due to VRAM limitations). Maybe in future updates Ollama can have options for aggressively cutting off histories?

<!-- gh-comment-id:3131963405 --> @maxjo020418 commented on GitHub (Jul 29, 2025): Okay, after some observation, some extra large chunks of function call/response and chat histories seems to be causing this. Ollama does try to cut the history to fit the context window, but either function call histories are not being removed enough (long function call records filling the history up) or Ollama conservatively cuts out histories that it exceeds the context window a bit occasionally (especially if the earliest history that wasn't removed is a large chunk) For now, I might cut down on function call history and other parts to reduce context (can't increase context due to VRAM limitations). Maybe in future updates Ollama can have options for aggressively cutting off histories?
Author
Owner

@rick-github commented on GitHub (Jul 29, 2025):

The ollama server will remove enough messages from the message list to make it fit the available context, even if it's just one free token. It removes the oldest messages first, while doing it's best to preserve the system message. Other than that, it doesn't distinguish different types of messages. Once inference starts, the buffer will be rotated if it fills up, which will result in the tokens at the head of the buffer being lost. This is where the initial system message is, which includes the tool list and instructions. It's up to the client to manage the message list if it wants more selective control of how the messages are handled.

<!-- gh-comment-id:3132114621 --> @rick-github commented on GitHub (Jul 29, 2025): The ollama server will remove enough messages from the message list to make it fit the available context, even if it's just one free token. It removes the oldest messages first, while doing it's best to preserve the system message. Other than that, it doesn't distinguish different types of messages. Once inference starts, the buffer will be rotated if it fills up, which will result in the tokens at the head of the buffer being lost. This is where the initial system message is, which includes the tool list and instructions. It's up to the client to manage the message list if it wants more selective control of how the messages are handled.
Author
Owner

@maxjo020418 commented on GitHub (Jul 31, 2025):

Thanks for the reply, will close the issue since it seems like the context window problem for the model itself.

<!-- gh-comment-id:3139267608 --> @maxjo020418 commented on GitHub (Jul 31, 2025): Thanks for the reply, will close the issue since it seems like the context window problem for the model itself.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7613