[GH-ISSUE #13886] Ollama v0.15.0: glm-4.7-flash:q8_0 stay in a think loop #9086

Open
opened 2026-04-12 21:55:51 -05:00 by GiteaMirror · 17 comments
Owner

Originally created by @somera on GitHub (Jan 24, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13886

What is the issue?

I updated to ollama 0.15.0 and downloaded glm-4.7-flash:q8_0.

And I have problems with this model. I'm testing it with Open WebUI. Sometimes it hangs in think loop.

Image Image

Same problems I had with ollama 0.14.3.

After ~5+ minutes I restarted ollama service.

And here the log:

Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=server.go:245 msg="enabling flash attention"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-1bfdff04a01e06051d7dcf5bcd6d7486240e1a92d2ce3325f727a20f2965e68c --port 41179"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=sched.go:452 msg="system memory" total="196.6 GiB" free="193.5 GiB" free_swap="0 B"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=sched.go:459 msg="gpu memory" id=GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae library=CUDA available="43.7 GiB" free="44.1 GiB" minimum="457.0 MiB" overhead="0 B"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=server.go:755 msg="loading model" "model layers"=48 requested=-1
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.380+01:00 level=INFO source=runner.go:1405 msg="starting ollama engine"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.381+01:00 level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:41179"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.390+01:00 level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType:q8_0 NumThreads:12 GPULayers:48[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.439+01:00 level=INFO source=ggml.go:136 msg="" architecture=glm4moelite file_type=Q8_0 name="" description="" num_tensors=844 num_key_values=39
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: ggml_cuda_init: found 1 CUDA devices:
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]:   Device 0: NVIDIA RTXA6000-48C, compute capability 8.6, VMM: no, ID: GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.510+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.789+01:00 level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType:q8_0 NumThreads:12 GPULayers:48[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType:q8_0 NumThreads:12 GPULayers:48[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=ggml.go:482 msg="offloading 47 repeating layers to GPU"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=ggml.go:494 msg="offloaded 48/48 layers to GPU"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="29.3 GiB"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="321.4 MiB"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="424.5 MiB"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="287.8 MiB"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="4.0 MiB"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:272 msg="total memory" size="30.3 GiB"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=sched.go:526 msg="loaded runners" count=1
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=server.go:1347 msg="waiting for llama runner to start responding"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=server.go:1381 msg="waiting for server to become available" status="llm server loading model"

At the moment I'm removig this model.

Relevant log output


OS

Ubuntu

GPU

Nvidia

CPU

AMD

Ollama version

0.15.0

Originally created by @somera on GitHub (Jan 24, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13886 ### What is the issue? I updated to ollama 0.15.0 and downloaded glm-4.7-flash:q8_0. And I have problems with this model. I'm testing it with Open WebUI. Sometimes it hangs in think loop. <img width="1283" height="747" alt="Image" src="https://github.com/user-attachments/assets/e2dd8335-6a41-4206-9384-d2a6409af14e" /> <img width="1060" height="422" alt="Image" src="https://github.com/user-attachments/assets/47b9f89f-43c6-40c8-91f3-54ed557f90b5" /> Same problems I had with ollama 0.14.3. After ~5+ minutes I restarted ollama service. And here the log: ``` Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=server.go:245 msg="enabling flash attention" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-1bfdff04a01e06051d7dcf5bcd6d7486240e1a92d2ce3325f727a20f2965e68c --port 41179" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=sched.go:452 msg="system memory" total="196.6 GiB" free="193.5 GiB" free_swap="0 B" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=sched.go:459 msg="gpu memory" id=GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae library=CUDA available="43.7 GiB" free="44.1 GiB" minimum="457.0 MiB" overhead="0 B" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=server.go:755 msg="loading model" "model layers"=48 requested=-1 Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.380+01:00 level=INFO source=runner.go:1405 msg="starting ollama engine" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.381+01:00 level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:41179" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.390+01:00 level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType:q8_0 NumThreads:12 GPULayers:48[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.439+01:00 level=INFO source=ggml.go:136 msg="" architecture=glm4moelite file_type=Q8_0 name="" description="" num_tensors=844 num_key_values=39 Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: ggml_cuda_init: found 1 CUDA devices: Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: Device 0: NVIDIA RTXA6000-48C, compute capability 8.6, VMM: no, ID: GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.510+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.789+01:00 level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType:q8_0 NumThreads:12 GPULayers:48[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType:q8_0 NumThreads:12 GPULayers:48[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=ggml.go:482 msg="offloading 47 repeating layers to GPU" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=ggml.go:494 msg="offloaded 48/48 layers to GPU" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="29.3 GiB" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="321.4 MiB" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="424.5 MiB" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="287.8 MiB" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="4.0 MiB" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:272 msg="total memory" size="30.3 GiB" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=sched.go:526 msg="loaded runners" count=1 Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=server.go:1347 msg="waiting for llama runner to start responding" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=server.go:1381 msg="waiting for server to become available" status="llm server loading model" ``` At the moment I'm removig this model. ### Relevant log output ```shell ``` ### OS Ubuntu ### GPU Nvidia ### CPU AMD ### Ollama version 0.15.0
GiteaMirror added the bug label 2026-04-12 21:55:51 -05:00
Author
Owner

@lackroy511 commented on GitHub (Jan 24, 2026):

q4_K_M same

<!-- gh-comment-id:3795347762 --> @lackroy511 commented on GitHub (Jan 24, 2026): q4_K_M same
Author
Owner

@VitalWind commented on GitHub (Jan 24, 2026):

Yeah, I'm having the same problem with glm-4.7-flash:q4_K_M on AMD GPU - it happens almost every time.

<!-- gh-comment-id:3795374469 --> @VitalWind commented on GitHub (Jan 24, 2026): Yeah, I'm having the same problem with glm-4.7-flash:q4_K_M on AMD GPU - it happens almost every time.
Author
Owner

@rick-github commented on GitHub (Jan 24, 2026):

Increase the size of the context buffer. If the prompt is large (eg code file) and the model does a lot of thinking, it can fill up the context buffer resulting in a context shift. This can cause the model to lose coherence and start rambling.

<!-- gh-comment-id:3795582764 --> @rick-github commented on GitHub (Jan 24, 2026): Increase the size of the context buffer. If the prompt is large (eg code file) and the model does a lot of thinking, it can fill up the context buffer resulting in a context shift. This can cause the model to lose coherence and start rambling.
Author
Owner

@somera commented on GitHub (Jan 24, 2026):

@rick-github my context was small (=8192).

<!-- gh-comment-id:3795599578 --> @somera commented on GitHub (Jan 24, 2026): @rick-github my context was small (=8192).
Author
Owner

@rick-github commented on GitHub (Jan 24, 2026):

Set OLLAMA_DEBUG=1 in the server environment and post the resulting log.

<!-- gh-comment-id:3795603561 --> @rick-github commented on GitHub (Jan 24, 2026): Set `OLLAMA_DEBUG=1` in the server environment and post the resulting log.
Author
Owner

@somera commented on GitHub (Jan 24, 2026):

Same problems with num_ctx=16384. I try to get the logs.

<!-- gh-comment-id:3795615745 --> @somera commented on GitHub (Jan 24, 2026): Same problems with `num_ctx=16384`. I try to get the logs.
Author
Owner

@somera commented on GitHub (Jan 24, 2026):

Image Image

Here the logs.

ollama.log

During the loop in Open WebUI I don't see any DEBUG log entries in syslog for ollama service.

In the 2nd run I see:

Jan 24 23:14:33 AI-DEV-VM-Neptun ollama[357968]: time=2026-01-24T23:14:33.968+01:00 level=DEBUG source=cache.go:286 msg="context limit hit - shifting" id=0 limit=16384 input=16384 keep=4 discard=8190
<!-- gh-comment-id:3795622296 --> @somera commented on GitHub (Jan 24, 2026): <img width="724" height="71" alt="Image" src="https://github.com/user-attachments/assets/0d804938-ffa7-4bc6-bfa9-0ffe418429b6" /> <img width="1264" height="978" alt="Image" src="https://github.com/user-attachments/assets/e40deaf8-9321-474b-a752-2309ab54fa0e" /> Here the logs. [ollama.log](https://github.com/user-attachments/files/24840345/ollama.log) During the loop in Open WebUI I don't see any DEBUG log entries in syslog for ollama service. In the 2nd run I see: ``` Jan 24 23:14:33 AI-DEV-VM-Neptun ollama[357968]: time=2026-01-24T23:14:33.968+01:00 level=DEBUG source=cache.go:286 msg="context limit hit - shifting" id=0 limit=16384 input=16384 keep=4 discard=8190 ```
Author
Owner

@EmmanuelMr18 commented on GitHub (Jan 24, 2026):

I got the same problem. In my case, I'm using cline and opencode, but the model gets in a loop because it cannot do tool calling correctly.

I was using the num_ctx=202752 and glm-4.7-flash:bf16 (But I also tested with q8 version and got the same)
Image

<!-- gh-comment-id:3795623067 --> @EmmanuelMr18 commented on GitHub (Jan 24, 2026): I got the same problem. In my case, I'm using cline and opencode, but the model gets in a loop because it cannot do tool calling correctly. I was using the `num_ctx=202752` and `glm-4.7-flash:bf16` (But I also tested with `q8` version and got the same) <img width="913" height="740" alt="Image" src="https://github.com/user-attachments/assets/cb331ac1-19fb-4585-aa0d-8146e89cc8b9" />
Author
Owner

@donatas-xyz commented on GitHub (Jan 24, 2026):

Same with the glm-4.7-flash:bf16's latest update. The first release was working fine with num_ctx_65000, however it was using 124GB of memory. This new update is only using 67GB of memory, but is either getting stuck in thinking (not in a loop - just stops thinking) or starts spewing gibberish out.

<!-- gh-comment-id:3795651118 --> @donatas-xyz commented on GitHub (Jan 24, 2026): Same with the `glm-4.7-flash:bf16`'s latest update. The first release was working fine with `num_ctx_65000`, however it was using 124GB of memory. This new update is only using 67GB of memory, but is either getting stuck in thinking (not in a loop - just stops thinking) or starts spewing gibberish out.
Author
Owner

@somera commented on GitHub (Jan 24, 2026):

I've done some tests and currently the model is unusable.

<!-- gh-comment-id:3795683237 --> @somera commented on GitHub (Jan 24, 2026): I've done some tests and currently the model is unusable.
Author
Owner

@UltraRabbit commented on GitHub (Jan 25, 2026):

I tried to pull the glm-4.7-flash:latest from ollama's repo. The same thinking loop occured how ever as 0.15.0 stated that the memory consumption was fixed as all offloaded to GPU. Then, I tried to download the unsloth/glm-4.7-flash-gguf:Q4_K_M from huggingface. I imported the .gguf model into ollama with a Modelfile and tested to run it with ollama. The thinking loop didn't occure, but the memory consumption problem coming back again even everything offloaded to GPU. It just hold the same amount of VRAM in the system RAM without releasing. I verified the latest .gguf model with latest llama.cpp and there's no memory consumption issue and no thinking loop occured. Two days before the llama.cpp also having the same issue and they fixed it in the latest version. Hoping the ollama could sync with the latest llama.cpp and regenerate the model soon.

<!-- gh-comment-id:3795853983 --> @UltraRabbit commented on GitHub (Jan 25, 2026): I tried to pull the glm-4.7-flash:latest from ollama's repo. The same thinking loop occured how ever as 0.15.0 stated that the memory consumption was fixed as all offloaded to GPU. Then, I tried to download the unsloth/glm-4.7-flash-gguf:Q4_K_M from huggingface. I imported the .gguf model into ollama with a Modelfile and tested to run it with ollama. The thinking loop didn't occure, but the memory consumption problem coming back again even everything offloaded to GPU. It just hold the same amount of VRAM in the system RAM without releasing. I verified the latest .gguf model with latest llama.cpp and there's no memory consumption issue and no thinking loop occured. Two days before the llama.cpp also having the same issue and they fixed it in the latest version. Hoping the ollama could sync with the latest llama.cpp and regenerate the model soon.
Author
Owner

@youyuzzg commented on GitHub (Jan 25, 2026):

I tried to pull the glm-4.7-flash:latest from ollama's repo. The same thinking loop occured how ever as 0.15.0 stated that the memory consumption was fixed as all offloaded to GPU. Then, I tried to download the unsloth/glm-4.7-flash-gguf:Q4_K_M from huggingface. I imported the .gguf model into ollama with a Modelfile and tested to run it with ollama. The thinking loop didn't occure, but the memory consumption problem coming back again even everything offloaded to GPU. It just hold the same amount of VRAM in the system RAM without releasing. I verified the latest .gguf model with latest llama.cpp and there's no memory consumption issue and no thinking loop occured. Two days before the llama.cpp also having the same issue and they fixed it in the latest version. Hoping the ollama could sync with the latest llama.cpp and regenerate the model soon.

+1

<!-- gh-comment-id:3795856665 --> @youyuzzg commented on GitHub (Jan 25, 2026): > I tried to pull the glm-4.7-flash:latest from ollama's repo. The same thinking loop occured how ever as 0.15.0 stated that the memory consumption was fixed as all offloaded to GPU. Then, I tried to download the unsloth/glm-4.7-flash-gguf:Q4_K_M from huggingface. I imported the .gguf model into ollama with a Modelfile and tested to run it with ollama. The thinking loop didn't occure, but the memory consumption problem coming back again even everything offloaded to GPU. It just hold the same amount of VRAM in the system RAM without releasing. I verified the latest .gguf model with latest llama.cpp and there's no memory consumption issue and no thinking loop occured. Two days before the llama.cpp also having the same issue and they fixed it in the latest version. Hoping the ollama could sync with the latest llama.cpp and regenerate the model soon. +1
Author
Owner

@VitalWind commented on GitHub (Jan 25, 2026):

I noticed the model was updated again a few hours ago and re-downloaded it. The maximum num_ctx capacity appears to be reduced compared to previous versions. In most cases, the model still gets stuck in an infinite thinking loop or stops during the "thinking" phase.

<!-- gh-comment-id:3795986063 --> @VitalWind commented on GitHub (Jan 25, 2026): I noticed the model was updated again a few hours ago and re-downloaded it. The maximum num_ctx capacity appears to be reduced compared to previous versions. In most cases, the model still gets stuck in an infinite thinking loop or stops during the "thinking" phase.
Author
Owner

@EmmanuelMr18 commented on GitHub (Jan 25, 2026):

This problem got fixed with release v0.15.1. I'm using opencode and the tool calling started to work again (no need to pull a new model, just update ollama).
Just update your ollama version curl -fsSL https://ollama.com/install.sh | sh

Commit that fixed the problem: a1ca428c90

<!-- gh-comment-id:3796011794 --> @EmmanuelMr18 commented on GitHub (Jan 25, 2026): This problem got fixed with release [v0.15.1](https://github.com/ollama/ollama/releases/tag/v0.15.1). I'm using opencode and the tool calling started to work again (no need to pull a new model, just update ollama). Just update your ollama version `curl -fsSL https://ollama.com/install.sh | sh` Commit that fixed the problem: https://github.com/ollama/ollama/commit/a1ca428c90e6a0d8e5a94be7806f3319ab2cc680
Author
Owner

@somera commented on GitHub (Jan 25, 2026):

I updated to ollama 0.15.1 ({"version":"0.15.1"}) and downloaded the model (ollama pull glm-4.7-flash:q8_0).

And the think loop problem still exists for me in Open WebUI.

With ollama run glm-4.7-flash:q8_0 it worked. The generated code works.

2nd try in Open WebIU (new chat) runs without a looo. But the generated code has a problem. Next prompt in the same chat to fix the problem produces the think loop.

Is this a Web UI problem?

This is my test prompt.

You are a senior Python developer.

Write a single-file, self-contained Python 3 program that implements a Tic-Tac-Toe game with a graphical UI using ONLY the standard library (tkinter). Do not use any external dependencies.

Requirements:
1) Board and rules
- Fixed 3×3 grid.
- Two players: "X" and "O", alternating turns.
- Classic win condition: 3 in a row (horizontal, vertical, diagonal).
- IMPORTANT special rule: Each player may have at most 3 symbols on the board at any time.
  - Keep a per-player move history (FIFO order).
  - When a player already has 3 symbols and makes a new move:
    a) Identify that player's oldest symbol (FIFO).
    b) Visually DIM that oldest symbol briefly (e.g., change its text color to gray) while it still remains visible.
    c) After a short delay (e.g., 200–300 ms), remove that oldest symbol from the board (clear the cell).
    d) Only AFTER the old symbol is removed, place the new symbol at the clicked cell.
- Ignore clicks on occupied cells.

2) Game flow
- Once a player wins, highlight the winning line (e.g., green text) and end the game:
  - Do not allow any further moves after a win until the user presses Reset.
- Provide a Reset button that clears the board and starts a new game with X.

3) UI / window behavior (important)
- Use a status line at the top that shows:
  - whose turn it is (or the winner), and
  - the current number of symbols on board for X and O (e.g., X:2/3 O:3/3).
- The window width MUST NOT change ("jitter") when the status text changes.
  Implement this by using a fixed-width status label (monospace font + fixed width) and freezing the window size after initial layout (e.g., update_idletasks + set min/max size).

4) Implementation notes
- Use tkinter Buttons for the 3×3 cells.
- Maintain board state in a 3×3 structure and maintain per-player FIFO history (e.g., collections.deque).
- During the dim/removal animation, temporarily lock input so the user cannot click other cells.
- Keep the code clean, readable, and well-structured (classes/functions).
- Include a main guard: if __name__ == "__main__": ...

Output:
- Return ONLY the complete Python code in one file. No explanations.

How sould I set the following parameters for glm-4.7-flash:

  • temperature
  • top-k
  • top-p
  • repeat penalty
<!-- gh-comment-id:3796324113 --> @somera commented on GitHub (Jan 25, 2026): I updated to ollama 0.15.1 (`{"version":"0.15.1"}`) and downloaded the model (`ollama pull glm-4.7-flash:q8_0`). And the think loop problem still exists for me in Open WebUI. With `ollama run glm-4.7-flash:q8_0` it worked. The generated code works. 2nd try in Open WebIU (new chat) runs without a looo. But the generated code has a problem. Next prompt in the same chat to fix the problem produces the think loop. Is this a Web UI problem? This is my test prompt. ``` You are a senior Python developer. Write a single-file, self-contained Python 3 program that implements a Tic-Tac-Toe game with a graphical UI using ONLY the standard library (tkinter). Do not use any external dependencies. Requirements: 1) Board and rules - Fixed 3×3 grid. - Two players: "X" and "O", alternating turns. - Classic win condition: 3 in a row (horizontal, vertical, diagonal). - IMPORTANT special rule: Each player may have at most 3 symbols on the board at any time. - Keep a per-player move history (FIFO order). - When a player already has 3 symbols and makes a new move: a) Identify that player's oldest symbol (FIFO). b) Visually DIM that oldest symbol briefly (e.g., change its text color to gray) while it still remains visible. c) After a short delay (e.g., 200–300 ms), remove that oldest symbol from the board (clear the cell). d) Only AFTER the old symbol is removed, place the new symbol at the clicked cell. - Ignore clicks on occupied cells. 2) Game flow - Once a player wins, highlight the winning line (e.g., green text) and end the game: - Do not allow any further moves after a win until the user presses Reset. - Provide a Reset button that clears the board and starts a new game with X. 3) UI / window behavior (important) - Use a status line at the top that shows: - whose turn it is (or the winner), and - the current number of symbols on board for X and O (e.g., X:2/3 O:3/3). - The window width MUST NOT change ("jitter") when the status text changes. Implement this by using a fixed-width status label (monospace font + fixed width) and freezing the window size after initial layout (e.g., update_idletasks + set min/max size). 4) Implementation notes - Use tkinter Buttons for the 3×3 cells. - Maintain board state in a 3×3 structure and maintain per-player FIFO history (e.g., collections.deque). - During the dim/removal animation, temporarily lock input so the user cannot click other cells. - Keep the code clean, readable, and well-structured (classes/functions). - Include a main guard: if __name__ == "__main__": ... Output: - Return ONLY the complete Python code in one file. No explanations. ``` How sould I set the following parameters for `glm-4.7-flash`: - temperature - top-k - top-p - repeat penalty
Author
Owner

@somera commented on GitHub (Jan 27, 2026):

Set OLLAMA_DEBUG=1 in the server environment and post the resulting log.

@rick-github have you seen something in the logs?

<!-- gh-comment-id:3806053051 --> @somera commented on GitHub (Jan 27, 2026): > Set `OLLAMA_DEBUG=1` in the server environment and post the resulting log. @rick-github have you seen something in the logs?
Author
Owner

@somera commented on GitHub (Feb 2, 2026):

@rick-github I updated from 0.15.1 to 0.15.4 and dowloaded glm-4.7-flash:q8_0 again. I don't see any fixes for the issue.

But at the moment I can't reproduce my problmes.

<!-- gh-comment-id:3833892270 --> @somera commented on GitHub (Feb 2, 2026): @rick-github I updated from 0.15.1 to 0.15.4 and dowloaded `glm-4.7-flash:q8_0` again. I don't see any fixes for the issue. But at the moment I can't reproduce my problmes.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9086