[GH-ISSUE #13840] Generation stops after tool call with Ollama (GLM-4.7-Flash) #9061

Closed
opened 2026-04-12 21:53:35 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @HuysArthur on GitHub (Jan 22, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13840

Originally assigned to: @jmorganca on GitHub.

What is the issue?

When running glm-4.7-flash:bf16 locally via Ollama (with context_length: 80000) and connecting it to OpenCode, the model halts generation immediately after outputting any tool call. I have to manually type "continue" every single time to resume.
The same model hosted through vLLM (instead of Ollama) works flawlessly in OpenCode — tool calls are handled normally without interruption.
This seems specific to Ollama's handling of this GLM-4.7 Flash variant during tool use. Has anyone else run into this?

Relevant log output


OS

linux arm

GPU

NVIDIA GH200

CPU

No response

Ollama version

0.14.3

Originally created by @HuysArthur on GitHub (Jan 22, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13840 Originally assigned to: @jmorganca on GitHub. ### What is the issue? When running glm-4.7-flash:bf16 locally via Ollama (with context_length: 80000) and connecting it to OpenCode, the model halts generation immediately after outputting any tool call. I have to manually type "continue" every single time to resume. The same model hosted through vLLM (instead of Ollama) works flawlessly in OpenCode — tool calls are handled normally without interruption. This seems specific to Ollama's handling of this GLM-4.7 Flash variant during tool use. Has anyone else run into this? ### Relevant log output ```shell ``` ### OS linux arm ### GPU NVIDIA GH200 ### CPU _No response_ ### Ollama version `0.14.3`
GiteaMirror added the bug label 2026-04-12 21:53:35 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 22, 2026):

Sounds like OpenCode is not returning the tool result to ollama. Running the agent loop shows that glm-4.7-flash:bf16 will process tools results.

$ ollama run glm-4.7-flash:bf16 --experimental --experimental-yolo

This experimental version of Ollama has the bash tool enabled.
Models can read files on your computer, or run commands (after you allow them).

warning: yolo mode - all tool approvals will be skipped
>>> what is the time?
Thinking...
The user is asking "what is the time?" - a simple question about the current time. To answer this, I need to use a tool to get the current time from the system. I have the bash tool available which can run 
system commands. I can use the `date` command to get the current time.
...done thinking.


running: Bash: date
  Thu Jan 22 10:50:28 AM CET 2026
  

Thinking...
The `date` command returned the current time as "Thu Jan 22 10:50:28 AM CET 2026". This tells me:

- It's Thursday, January 22nd, 2026
- The time is 10:50:28 AM
- It's in Central European Time (CET) timezone

I should provide this information in a clear, readable format for the user.
...done thinking.

The current time is **10:50:28 AM** on Thursday, January 22nd, 2026 (CET).

Does OpenCode have logs?

Server log from ollama with OLLAMA_DEBUG=2 may help in debugging. Note that this will include the prompt so be aware of PII.

<!-- gh-comment-id:3783509718 --> @rick-github commented on GitHub (Jan 22, 2026): Sounds like OpenCode is not returning the tool result to ollama. Running the agent loop shows that glm-4.7-flash:bf16 will process tools results. ```console $ ollama run glm-4.7-flash:bf16 --experimental --experimental-yolo This experimental version of Ollama has the bash tool enabled. Models can read files on your computer, or run commands (after you allow them). warning: yolo mode - all tool approvals will be skipped >>> what is the time? Thinking... The user is asking "what is the time?" - a simple question about the current time. To answer this, I need to use a tool to get the current time from the system. I have the bash tool available which can run system commands. I can use the `date` command to get the current time. ...done thinking. running: Bash: date Thu Jan 22 10:50:28 AM CET 2026 Thinking... The `date` command returned the current time as "Thu Jan 22 10:50:28 AM CET 2026". This tells me: - It's Thursday, January 22nd, 2026 - The time is 10:50:28 AM - It's in Central European Time (CET) timezone I should provide this information in a clear, readable format for the user. ...done thinking. The current time is **10:50:28 AM** on Thursday, January 22nd, 2026 (CET). ``` Does OpenCode have logs? [Server log](https://docs.ollama.com/troubleshooting) from ollama with `OLLAMA_DEBUG=2` may help in debugging. Note that this will include the prompt so be aware of PII.
Author
Owner

@rick-github commented on GitHub (Jan 22, 2026):

Just installed OpenCode and gave it a try, tool calls seem to work fine:

$ opencode run "Create the file 'README' with the contents 'hello world'" --model ollama/glm-4.7-flash-c198k
|  Write    README

File ' README' created successfully.

$ ls -al
total 12
drwxrwxr-x 2 rick rick 4096 Jan 22 18:42 .
drwxrwxr-x 3 rick rick 4096 Jan 22 18:42 ..
-rw-rw-r-- 1 rick rick   11 Jan 22 18:42 README
$ cat README 
hello world
<!-- gh-comment-id:3786083907 --> @rick-github commented on GitHub (Jan 22, 2026): Just installed OpenCode and gave it a try, tool calls seem to work fine: ```console $ opencode run "Create the file 'README' with the contents 'hello world'" --model ollama/glm-4.7-flash-c198k | Write README File ' README' created successfully. $ ls -al total 12 drwxrwxr-x 2 rick rick 4096 Jan 22 18:42 . drwxrwxr-x 3 rick rick 4096 Jan 22 18:42 .. -rw-rw-r-- 1 rick rick 11 Jan 22 18:42 README $ cat README hello world ```
Author
Owner

@sohelzerdoumi commented on GitHub (Jan 22, 2026):

I’m seeing the same issue with Ollama when using tools with glm-4.7-flash:latest.

"glm-4.6 tool call parsing failed" error="failed to parse XML: XML syntax error on line 1: element <tool_call> closed by </arg_key>"

The error only appears once the context grows beyond ~5k tokens and multiple tool calls are chained. With shorter contexts, everything works fine. When it fails, the model outputs malformed XML (e.g. mismatched <tool_call> / <arg_key> tags), which breaks parsing.

My suspicion is that this is related to quantization.

<!-- gh-comment-id:3786658905 --> @sohelzerdoumi commented on GitHub (Jan 22, 2026): I’m seeing the same issue with Ollama when using tools with glm-4.7-flash:latest. ``` "glm-4.6 tool call parsing failed" error="failed to parse XML: XML syntax error on line 1: element <tool_call> closed by </arg_key>" ``` The error only appears once the context grows beyond ~5k tokens and multiple tool calls are chained. With shorter contexts, everything works fine. When it fails, the model outputs malformed XML (e.g. mismatched <tool_call> / <arg_key> tags), which breaks parsing. My suspicion is that this is related to quantization.
Author
Owner

@rick-github commented on GitHub (Jan 22, 2026):

What context size have you set for the model?

<!-- gh-comment-id:3786683399 --> @rick-github commented on GitHub (Jan 22, 2026): What context size have you set for the model?
Author
Owner

@sohelzerdoumi commented on GitHub (Jan 22, 2026):

I have 30 000 tokens context length and 40G Vram. The is enough to fit.
Environment="OLLAMA_CONTEXT_LENGTH=30000"

<!-- gh-comment-id:3786692120 --> @sohelzerdoumi commented on GitHub (Jan 22, 2026): I have 30 000 tokens context length and 40G Vram. The is enough to fit. Environment="OLLAMA_CONTEXT_LENGTH=30000"
Author
Owner

@rick-github commented on GitHub (Jan 22, 2026):

Server log from ollama with OLLAMA_DEBUG=2 may help in debugging. Note that this will include the prompt so be aware of PII.

<!-- gh-comment-id:3786697721 --> @rick-github commented on GitHub (Jan 22, 2026): [Server log](https://docs.ollama.com/troubleshooting) from ollama with `OLLAMA_DEBUG=2` may help in debugging. Note that this will include the prompt so be aware of PII.
Author
Owner

@sohelzerdoumi commented on GitHub (Jan 22, 2026):

I increased max_tokens and context to 40,000 tokens each, and it worked. My bad.

Here is my trace from when it failed.

janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=bytepairencoding.go:270 msg=decoded string=</tool_call> from=[154844]
janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=runner.go:657 msg="computeBatch: outputs are ready" batchID=6298
janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=runner.go:652 msg="computeBatch: inputs are ready" batchID=6299
janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=routes.go:2257 msg="builtin parser input" parser=glm-4.7 content=</tool_call>
janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=glm46.go:118 msg="glm-4.6 events parsed" events="[{raw:write_file</arg_key><arg_value>[REDACTED]</arg_value><arg_key>content</arg_key><arg_value>[REDACTED]</arg_value>}]" state=4 buffer=""
janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=WARN source=glm46.go:89 msg="glm-4.6 tool call parsing failed" error="failed to parse XML: XML syntax error on line 1: element <tool_call> closed by </arg_key>"
janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=runner.go:725 msg="computeBatch: signaling computeStartedCh" batchID=6299
janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=runner.go:476 msg="forwardBatch compute started, setting up next batch" pendingBatch.id=6299 id=6300
janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=runner.go:598 msg="forwardBatch iBatch" batchID=6300 seqIdx=0 seq.iBatch=0 i+1=1 len(seq.inputs)=1
<!-- gh-comment-id:3786768534 --> @sohelzerdoumi commented on GitHub (Jan 22, 2026): I increased max_tokens and context to 40,000 tokens each, and it worked. My bad. Here is my trace from when it failed. ``` janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=bytepairencoding.go:270 msg=decoded string=</tool_call> from=[154844] janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=runner.go:657 msg="computeBatch: outputs are ready" batchID=6298 janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=runner.go:652 msg="computeBatch: inputs are ready" batchID=6299 janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=routes.go:2257 msg="builtin parser input" parser=glm-4.7 content=</tool_call> janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=glm46.go:118 msg="glm-4.6 events parsed" events="[{raw:write_file</arg_key><arg_value>[REDACTED]</arg_value><arg_key>content</arg_key><arg_value>[REDACTED]</arg_value>}]" state=4 buffer="" janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=WARN source=glm46.go:89 msg="glm-4.6 tool call parsing failed" error="failed to parse XML: XML syntax error on line 1: element <tool_call> closed by </arg_key>" janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=runner.go:725 msg="computeBatch: signaling computeStartedCh" batchID=6299 janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=runner.go:476 msg="forwardBatch compute started, setting up next batch" pendingBatch.id=6299 id=6300 janv. 22 22:13:56 gpu-server ollama[3374697]: time=2026-01-22T22:13:56.576+01:00 level=TRACE source=runner.go:598 msg="forwardBatch iBatch" batchID=6300 seqIdx=0 seq.iBatch=0 i+1=1 len(seq.inputs)=1 ```
Author
Owner

@rick-github commented on GitHub (Jan 22, 2026):

OpenCode requires a lot of context, just the base level tools and instructions needs ~11k tokens. I asked it to create a golang program to print "hello world" and add tests, and the context grew to 25k tokens.

<!-- gh-comment-id:3786973148 --> @rick-github commented on GitHub (Jan 22, 2026): OpenCode requires a lot of context, just the base level tools and instructions needs ~11k tokens. I asked it to create a golang program to print "hello world" and add tests, and the context grew to 25k tokens.
Author
Owner

@HuysArthur commented on GitHub (Jan 23, 2026):

Turns out the issue isn't with Ollama itself.
I was routing OpenCode → Open WebUI API → Ollama.
GLM-4.7-Flash works perfectly when served directly via vLLM, and other models (e.g. gpt-oss) work fine through the same WebUI → Ollama path.
So the problem seems specific to how GLM-4.7-Flash behaves when going through Ollama + WebUI as a middleman.

<!-- gh-comment-id:3789066808 --> @HuysArthur commented on GitHub (Jan 23, 2026): Turns out the issue isn't with Ollama itself. I was routing OpenCode → Open WebUI API → Ollama. GLM-4.7-Flash works perfectly when served directly via vLLM, and other models (e.g. gpt-oss) work fine through the same WebUI → Ollama path. So the problem seems specific to how GLM-4.7-Flash behaves when going through Ollama + WebUI as a middleman.
Author
Owner

@HuysArthur commented on GitHub (Jan 23, 2026):

Here is a clear difference in behavior:

user@host dir % opencode run "Create the file 'README' with the contents 'hello world', let me know if it succeeded" --model webui/glm-4.7-flash:bf16-80k
|  Write    Users/user/dir/README
user@host dir % rm README                                                                                                                               
user@host dir % opencode run "Create the file 'README' with the contents 'hello world', let me know if it succeeded" --model ollama/glm-4.7-flash:bf16-80k     

I'll create the README file with the specified content.

|  Write    Users/user/dir/README

README created successfully with content "hello world".

I didn't change any of the model configuration in the webUI model config:

glm-4.7-flash_bf16-80k-1769159953751.json

<!-- gh-comment-id:3789249677 --> @HuysArthur commented on GitHub (Jan 23, 2026): Here is a clear difference in behavior: ``` user@host dir % opencode run "Create the file 'README' with the contents 'hello world', let me know if it succeeded" --model webui/glm-4.7-flash:bf16-80k | Write Users/user/dir/README user@host dir % rm README user@host dir % opencode run "Create the file 'README' with the contents 'hello world', let me know if it succeeded" --model ollama/glm-4.7-flash:bf16-80k I'll create the README file with the specified content. | Write Users/user/dir/README README created successfully with content "hello world". ``` I didn't change any of the model configuration in the webUI model config: [glm-4.7-flash_bf16-80k-1769159953751.json](https://github.com/user-attachments/files/24817410/glm-4.7-flash_bf16-80k-1769159953751.json)
Author
Owner

@bitflower commented on GitHub (Jan 25, 2026):

I run this very model with claude code using ollama launch claude. When I prompt, it seems to lose all context with the next prompt. Version 0.15.0.

<!-- gh-comment-id:3797220485 --> @bitflower commented on GitHub (Jan 25, 2026): I run this very model with claude code using `ollama launch claude`. When I prompt, it seems to lose all context with the next prompt. Version 0.15.0.
Author
Owner

@rick-github commented on GitHub (Jan 25, 2026):

Increase the size of the context buffer in the server. The documentation recommends 64k.

<!-- gh-comment-id:3797236849 --> @rick-github commented on GitHub (Jan 25, 2026): Increase the size of the [context buffer](https://github.com/ollama/ollama/blob/main/docs/faq.mdx#how-can-i-specify-the-context-window-size) in the server. The documentation [recommends](https://docs.ollama.com/integrations/claude-code#manual-setup) 64k.
Author
Owner

@HuysArthur commented on GitHub (Jan 27, 2026):

Found out this is a OpenWebUI issue, so no longer relevant here.

<!-- gh-comment-id:3804292587 --> @HuysArthur commented on GitHub (Jan 27, 2026): Found out this is a OpenWebUI issue, so no longer relevant here.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9061