[GH-ISSUE #13886] Ollama v0.15.0: glm-4.7-flash:q8_0 stay in a think loop #9086

New Issue

GiteaMirror · 2026-04-12T21:55:51-05:00

GiteaMirror commented

2026-04-12 21:55:51 -05:00

Originally created by @somera on GitHub (Jan 24, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13886

What is the issue?

I updated to ollama 0.15.0 and downloaded glm-4.7-flash:q8_0.

And I have problems with this model. I'm testing it with Open WebUI. Sometimes it hangs in think loop.

Same problems I had with ollama 0.14.3.

After ~5+ minutes I restarted ollama service.

And here the log:

Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=server.go:245 msg="enabling flash attention"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-1bfdff04a01e06051d7dcf5bcd6d7486240e1a92d2ce3325f727a20f2965e68c --port 41179"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=sched.go:452 msg="system memory" total="196.6 GiB" free="193.5 GiB" free_swap="0 B"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=sched.go:459 msg="gpu memory" id=GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae library=CUDA available="43.7 GiB" free="44.1 GiB" minimum="457.0 MiB" overhead="0 B"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=server.go:755 msg="loading model" "model layers"=48 requested=-1
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.380+01:00 level=INFO source=runner.go:1405 msg="starting ollama engine"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.381+01:00 level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:41179"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.390+01:00 level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType:q8_0 NumThreads:12 GPULayers:48[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.439+01:00 level=INFO source=ggml.go:136 msg="" architecture=glm4moelite file_type=Q8_0 name="" description="" num_tensors=844 num_key_values=39
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: ggml_cuda_init: found 1 CUDA devices:
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]:   Device 0: NVIDIA RTXA6000-48C, compute capability 8.6, VMM: no, ID: GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.510+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.789+01:00 level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType:q8_0 NumThreads:12 GPULayers:48[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType:q8_0 NumThreads:12 GPULayers:48[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=ggml.go:482 msg="offloading 47 repeating layers to GPU"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=ggml.go:494 msg="offloaded 48/48 layers to GPU"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="29.3 GiB"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="321.4 MiB"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="424.5 MiB"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="287.8 MiB"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="4.0 MiB"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:272 msg="total memory" size="30.3 GiB"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=sched.go:526 msg="loaded runners" count=1
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=server.go:1347 msg="waiting for llama runner to start responding"
Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=server.go:1381 msg="waiting for server to become available" status="llm server loading model"

At the moment I'm removig this model.

Relevant log output

OS

Ubuntu

GPU

Nvidia

CPU

AMD

Ollama version

0.15.0

Originally created by @somera on GitHub (Jan 24, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13886 ### What is the issue? I updated to ollama 0.15.0 and downloaded glm-4.7-flash:q8_0. And I have problems with this model. I'm testing it with Open WebUI. Sometimes it hangs in think loop. <img width="1283" height="747" alt="Image" src="https://github.com/user-attachments/assets/e2dd8335-6a41-4206-9384-d2a6409af14e" /> <img width="1060" height="422" alt="Image" src="https://github.com/user-attachments/assets/47b9f89f-43c6-40c8-91f3-54ed557f90b5" /> Same problems I had with ollama 0.14.3. After ~5+ minutes I restarted ollama service. And here the log: ``` Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=server.go:245 msg="enabling flash attention" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-1bfdff04a01e06051d7dcf5bcd6d7486240e1a92d2ce3325f727a20f2965e68c --port 41179" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=sched.go:452 msg="system memory" total="196.6 GiB" free="193.5 GiB" free_swap="0 B" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=sched.go:459 msg="gpu memory" id=GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae library=CUDA available="43.7 GiB" free="44.1 GiB" minimum="457.0 MiB" overhead="0 B" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.367+01:00 level=INFO source=server.go:755 msg="loading model" "model layers"=48 requested=-1 Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.380+01:00 level=INFO source=runner.go:1405 msg="starting ollama engine" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.381+01:00 level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:41179" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.390+01:00 level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType:q8_0 NumThreads:12 GPULayers:48[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.439+01:00 level=INFO source=ggml.go:136 msg="" architecture=glm4moelite file_type=Q8_0 name="" description="" num_tensors=844 num_key_values=39 Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: ggml_cuda_init: found 1 CUDA devices: Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: Device 0: NVIDIA RTXA6000-48C, compute capability 8.6, VMM: no, ID: GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.510+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.789+01:00 level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType:q8_0 NumThreads:12 GPULayers:48[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType:q8_0 NumThreads:12 GPULayers:48[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=ggml.go:482 msg="offloading 47 repeating layers to GPU" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.911+01:00 level=INFO source=ggml.go:494 msg="offloaded 48/48 layers to GPU" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="29.3 GiB" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="321.4 MiB" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="424.5 MiB" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="287.8 MiB" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="4.0 MiB" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=device.go:272 msg="total memory" size="30.3 GiB" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=sched.go:526 msg="loaded runners" count=1 Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=server.go:1347 msg="waiting for llama runner to start responding" Jan 24 19:38:04 AI-DEV-VM-Neptun ollama[325411]: time=2026-01-24T19:38:04.912+01:00 level=INFO source=server.go:1381 msg="waiting for server to become available" status="llm server loading model" ``` At the moment I'm removig this model. ### Relevant log output ```shell ``` ### OS Ubuntu ### GPU Nvidia ### CPU AMD ### Ollama version 0.15.0

GiteaMirror added the bug label 2026-04-12 21:55:51 -05:00

GiteaMirror commented

2026-04-12 21:55:53 -05:00

@lackroy511 commented on GitHub (Jan 24, 2026):

q4_K_M same

@lackroy511 commented on GitHub (Jan 24, 2026): q4_K_M same

GiteaMirror commented

2026-04-12 21:55:53 -05:00

@VitalWind commented on GitHub (Jan 24, 2026):

Yeah, I'm having the same problem with glm-4.7-flash:q4_K_M on AMD GPU - it happens almost every time.

@VitalWind commented on GitHub (Jan 24, 2026): Yeah, I'm having the same problem with glm-4.7-flash:q4_K_M on AMD GPU - it happens almost every time.

GiteaMirror commented

2026-04-12 21:55:54 -05:00

@rick-github commented on GitHub (Jan 24, 2026):

Increase the size of the context buffer. If the prompt is large (eg code file) and the model does a lot of thinking, it can fill up the context buffer resulting in a context shift. This can cause the model to lose coherence and start rambling.

@rick-github commented on GitHub (Jan 24, 2026): Increase the size of the context buffer. If the prompt is large (eg code file) and the model does a lot of thinking, it can fill up the context buffer resulting in a context shift. This can cause the model to lose coherence and start rambling.

GiteaMirror commented

2026-04-12 21:55:55 -05:00

@somera commented on GitHub (Jan 24, 2026):

@rick-github my context was small (=8192).

@somera commented on GitHub (Jan 24, 2026): @rick-github my context was small (=8192).

GiteaMirror commented

2026-04-12 21:55:56 -05:00

@rick-github commented on GitHub (Jan 24, 2026):

Set OLLAMA_DEBUG=1 in the server environment and post the resulting log.

@rick-github commented on GitHub (Jan 24, 2026): Set `OLLAMA_DEBUG=1` in the server environment and post the resulting log.

GiteaMirror commented

2026-04-12 21:55:59 -05:00

@somera commented on GitHub (Jan 24, 2026):

Same problems with num_ctx=16384. I try to get the logs.

@somera commented on GitHub (Jan 24, 2026): Same problems with `num_ctx=16384`. I try to get the logs.

GiteaMirror commented

2026-04-12 21:56:00 -05:00

@somera commented on GitHub (Jan 24, 2026):

Here the logs.

ollama.log

During the loop in Open WebUI I don't see any DEBUG log entries in syslog for ollama service.

In the 2nd run I see:

Jan 24 23:14:33 AI-DEV-VM-Neptun ollama[357968]: time=2026-01-24T23:14:33.968+01:00 level=DEBUG source=cache.go:286 msg="context limit hit - shifting" id=0 limit=16384 input=16384 keep=4 discard=8190

@somera commented on GitHub (Jan 24, 2026): <img width="724" height="71" alt="Image" src="https://github.com/user-attachments/assets/0d804938-ffa7-4bc6-bfa9-0ffe418429b6" /> <img width="1264" height="978" alt="Image" src="https://github.com/user-attachments/assets/e40deaf8-9321-474b-a752-2309ab54fa0e" /> Here the logs. [ollama.log](https://github.com/user-attachments/files/24840345/ollama.log) During the loop in Open WebUI I don't see any DEBUG log entries in syslog for ollama service. In the 2nd run I see: ``` Jan 24 23:14:33 AI-DEV-VM-Neptun ollama[357968]: time=2026-01-24T23:14:33.968+01:00 level=DEBUG source=cache.go:286 msg="context limit hit - shifting" id=0 limit=16384 input=16384 keep=4 discard=8190 ```

GiteaMirror commented

2026-04-12 21:56:01 -05:00

@EmmanuelMr18 commented on GitHub (Jan 24, 2026):

I got the same problem. In my case, I'm using cline and opencode, but the model gets in a loop because it cannot do tool calling correctly.

I was using the num_ctx=202752 and glm-4.7-flash:bf16 (But I also tested with q8 version and got the same)

@EmmanuelMr18 commented on GitHub (Jan 24, 2026): I got the same problem. In my case, I'm using cline and opencode, but the model gets in a loop because it cannot do tool calling correctly. I was using the `num_ctx=202752` and `glm-4.7-flash:bf16` (But I also tested with `q8` version and got the same) <img width="913" height="740" alt="Image" src="https://github.com/user-attachments/assets/cb331ac1-19fb-4585-aa0d-8146e89cc8b9" />

GiteaMirror commented

2026-04-12 21:56:02 -05:00

@donatas-xyz commented on GitHub (Jan 24, 2026):

Same with the glm-4.7-flash:bf16's latest update. The first release was working fine with num_ctx_65000, however it was using 124GB of memory. This new update is only using 67GB of memory, but is either getting stuck in thinking (not in a loop - just stops thinking) or starts spewing gibberish out.

@donatas-xyz commented on GitHub (Jan 24, 2026): Same with the `glm-4.7-flash:bf16`'s latest update. The first release was working fine with `num_ctx_65000`, however it was using 124GB of memory. This new update is only using 67GB of memory, but is either getting stuck in thinking (not in a loop - just stops thinking) or starts spewing gibberish out.

GiteaMirror commented

2026-04-12 21:56:03 -05:00

@somera commented on GitHub (Jan 24, 2026):

I've done some tests and currently the model is unusable.

@somera commented on GitHub (Jan 24, 2026): I've done some tests and currently the model is unusable.

GiteaMirror commented

2026-04-12 21:56:04 -05:00

@UltraRabbit commented on GitHub (Jan 25, 2026):

I tried to pull the glm-4.7-flash:latest from ollama's repo. The same thinking loop occured how ever as 0.15.0 stated that the memory consumption was fixed as all offloaded to GPU. Then, I tried to download the unsloth/glm-4.7-flash-gguf:Q4_K_M from huggingface. I imported the .gguf model into ollama with a Modelfile and tested to run it with ollama. The thinking loop didn't occure, but the memory consumption problem coming back again even everything offloaded to GPU. It just hold the same amount of VRAM in the system RAM without releasing. I verified the latest .gguf model with latest llama.cpp and there's no memory consumption issue and no thinking loop occured. Two days before the llama.cpp also having the same issue and they fixed it in the latest version. Hoping the ollama could sync with the latest llama.cpp and regenerate the model soon.

@UltraRabbit commented on GitHub (Jan 25, 2026): I tried to pull the glm-4.7-flash:latest from ollama's repo. The same thinking loop occured how ever as 0.15.0 stated that the memory consumption was fixed as all offloaded to GPU. Then, I tried to download the unsloth/glm-4.7-flash-gguf:Q4_K_M from huggingface. I imported the .gguf model into ollama with a Modelfile and tested to run it with ollama. The thinking loop didn't occure, but the memory consumption problem coming back again even everything offloaded to GPU. It just hold the same amount of VRAM in the system RAM without releasing. I verified the latest .gguf model with latest llama.cpp and there's no memory consumption issue and no thinking loop occured. Two days before the llama.cpp also having the same issue and they fixed it in the latest version. Hoping the ollama could sync with the latest llama.cpp and regenerate the model soon.

GiteaMirror commented

2026-04-12 21:56:05 -05:00

@youyuzzg commented on GitHub (Jan 25, 2026):

I tried to pull the glm-4.7-flash:latest from ollama's repo. The same thinking loop occured how ever as 0.15.0 stated that the memory consumption was fixed as all offloaded to GPU. Then, I tried to download the unsloth/glm-4.7-flash-gguf:Q4_K_M from huggingface. I imported the .gguf model into ollama with a Modelfile and tested to run it with ollama. The thinking loop didn't occure, but the memory consumption problem coming back again even everything offloaded to GPU. It just hold the same amount of VRAM in the system RAM without releasing. I verified the latest .gguf model with latest llama.cpp and there's no memory consumption issue and no thinking loop occured. Two days before the llama.cpp also having the same issue and they fixed it in the latest version. Hoping the ollama could sync with the latest llama.cpp and regenerate the model soon.

+1

@youyuzzg commented on GitHub (Jan 25, 2026): > I tried to pull the glm-4.7-flash:latest from ollama's repo. The same thinking loop occured how ever as 0.15.0 stated that the memory consumption was fixed as all offloaded to GPU. Then, I tried to download the unsloth/glm-4.7-flash-gguf:Q4_K_M from huggingface. I imported the .gguf model into ollama with a Modelfile and tested to run it with ollama. The thinking loop didn't occure, but the memory consumption problem coming back again even everything offloaded to GPU. It just hold the same amount of VRAM in the system RAM without releasing. I verified the latest .gguf model with latest llama.cpp and there's no memory consumption issue and no thinking loop occured. Two days before the llama.cpp also having the same issue and they fixed it in the latest version. Hoping the ollama could sync with the latest llama.cpp and regenerate the model soon. +1

GiteaMirror commented

2026-04-12 21:56:06 -05:00

@VitalWind commented on GitHub (Jan 25, 2026):

I noticed the model was updated again a few hours ago and re-downloaded it. The maximum num_ctx capacity appears to be reduced compared to previous versions. In most cases, the model still gets stuck in an infinite thinking loop or stops during the "thinking" phase.

@VitalWind commented on GitHub (Jan 25, 2026): I noticed the model was updated again a few hours ago and re-downloaded it. The maximum num_ctx capacity appears to be reduced compared to previous versions. In most cases, the model still gets stuck in an infinite thinking loop or stops during the "thinking" phase.

GiteaMirror commented

2026-04-12 21:56:07 -05:00

@EmmanuelMr18 commented on GitHub (Jan 25, 2026):

This problem got fixed with release v0.15.1. I'm using opencode and the tool calling started to work again (no need to pull a new model, just update ollama).
Just update your ollama version curl -fsSL https://ollama.com/install.sh | sh

Commit that fixed the problem: a1ca428c90

@EmmanuelMr18 commented on GitHub (Jan 25, 2026): This problem got fixed with release [v0.15.1](https://github.com/ollama/ollama/releases/tag/v0.15.1). I'm using opencode and the tool calling started to work again (no need to pull a new model, just update ollama). Just update your ollama version `curl -fsSL https://ollama.com/install.sh | sh` Commit that fixed the problem: https://github.com/ollama/ollama/commit/a1ca428c90e6a0d8e5a94be7806f3319ab2cc680

GiteaMirror commented

2026-04-12 21:56:07 -05:00

@somera commented on GitHub (Jan 25, 2026):

I updated to ollama 0.15.1 ({"version":"0.15.1"}) and downloaded the model (ollama pull glm-4.7-flash:q8_0).

And the think loop problem still exists for me in Open WebUI.

With ollama run glm-4.7-flash:q8_0 it worked. The generated code works.

2nd try in Open WebIU (new chat) runs without a looo. But the generated code has a problem. Next prompt in the same chat to fix the problem produces the think loop.

Is this a Web UI problem?

This is my test prompt.

You are a senior Python developer.

Write a single-file, self-contained Python 3 program that implements a Tic-Tac-Toe game with a graphical UI using ONLY the standard library (tkinter). Do not use any external dependencies.

Requirements:
1) Board and rules
- Fixed 3×3 grid.
- Two players: "X" and "O", alternating turns.
- Classic win condition: 3 in a row (horizontal, vertical, diagonal).
- IMPORTANT special rule: Each player may have at most 3 symbols on the board at any time.
  - Keep a per-player move history (FIFO order).
  - When a player already has 3 symbols and makes a new move:
    a) Identify that player's oldest symbol (FIFO).
    b) Visually DIM that oldest symbol briefly (e.g., change its text color to gray) while it still remains visible.
    c) After a short delay (e.g., 200–300 ms), remove that oldest symbol from the board (clear the cell).
    d) Only AFTER the old symbol is removed, place the new symbol at the clicked cell.
- Ignore clicks on occupied cells.

2) Game flow
- Once a player wins, highlight the winning line (e.g., green text) and end the game:
  - Do not allow any further moves after a win until the user presses Reset.
- Provide a Reset button that clears the board and starts a new game with X.

3) UI / window behavior (important)
- Use a status line at the top that shows:
  - whose turn it is (or the winner), and
  - the current number of symbols on board for X and O (e.g., X:2/3 O:3/3).
- The window width MUST NOT change ("jitter") when the status text changes.
  Implement this by using a fixed-width status label (monospace font + fixed width) and freezing the window size after initial layout (e.g., update_idletasks + set min/max size).

4) Implementation notes
- Use tkinter Buttons for the 3×3 cells.
- Maintain board state in a 3×3 structure and maintain per-player FIFO history (e.g., collections.deque).
- During the dim/removal animation, temporarily lock input so the user cannot click other cells.
- Keep the code clean, readable, and well-structured (classes/functions).
- Include a main guard: if __name__ == "__main__": ...

Output:
- Return ONLY the complete Python code in one file. No explanations.

How sould I set the following parameters for glm-4.7-flash:

temperature
top-k
top-p
repeat penalty

@somera commented on GitHub (Jan 25, 2026): I updated to ollama 0.15.1 (`{"version":"0.15.1"}`) and downloaded the model (`ollama pull glm-4.7-flash:q8_0`). And the think loop problem still exists for me in Open WebUI. With `ollama run glm-4.7-flash:q8_0` it worked. The generated code works. 2nd try in Open WebIU (new chat) runs without a looo. But the generated code has a problem. Next prompt in the same chat to fix the problem produces the think loop. Is this a Web UI problem? This is my test prompt. ``` You are a senior Python developer. Write a single-file, self-contained Python 3 program that implements a Tic-Tac-Toe game with a graphical UI using ONLY the standard library (tkinter). Do not use any external dependencies. Requirements: 1) Board and rules - Fixed 3×3 grid. - Two players: "X" and "O", alternating turns. - Classic win condition: 3 in a row (horizontal, vertical, diagonal). - IMPORTANT special rule: Each player may have at most 3 symbols on the board at any time. - Keep a per-player move history (FIFO order). - When a player already has 3 symbols and makes a new move: a) Identify that player's oldest symbol (FIFO). b) Visually DIM that oldest symbol briefly (e.g., change its text color to gray) while it still remains visible. c) After a short delay (e.g., 200–300 ms), remove that oldest symbol from the board (clear the cell). d) Only AFTER the old symbol is removed, place the new symbol at the clicked cell. - Ignore clicks on occupied cells. 2) Game flow - Once a player wins, highlight the winning line (e.g., green text) and end the game: - Do not allow any further moves after a win until the user presses Reset. - Provide a Reset button that clears the board and starts a new game with X. 3) UI / window behavior (important) - Use a status line at the top that shows: - whose turn it is (or the winner), and - the current number of symbols on board for X and O (e.g., X:2/3 O:3/3). - The window width MUST NOT change ("jitter") when the status text changes. Implement this by using a fixed-width status label (monospace font + fixed width) and freezing the window size after initial layout (e.g., update_idletasks + set min/max size). 4) Implementation notes - Use tkinter Buttons for the 3×3 cells. - Maintain board state in a 3×3 structure and maintain per-player FIFO history (e.g., collections.deque). - During the dim/removal animation, temporarily lock input so the user cannot click other cells. - Keep the code clean, readable, and well-structured (classes/functions). - Include a main guard: if __name__ == "__main__": ... Output: - Return ONLY the complete Python code in one file. No explanations. ``` How sould I set the following parameters for `glm-4.7-flash`: - temperature - top-k - top-p - repeat penalty

GiteaMirror commented

2026-04-12 21:56:08 -05:00

@somera commented on GitHub (Jan 27, 2026):

Set OLLAMA_DEBUG=1 in the server environment and post the resulting log.

@rick-github have you seen something in the logs?

@somera commented on GitHub (Jan 27, 2026): > Set `OLLAMA_DEBUG=1` in the server environment and post the resulting log. @rick-github have you seen something in the logs?

GiteaMirror commented

2026-04-12 21:56:10 -05:00

@somera commented on GitHub (Feb 2, 2026):

@rick-github I updated from 0.15.1 to 0.15.4 and dowloaded glm-4.7-flash:q8_0 again. I don't see any fixes for the issue.

But at the moment I can't reproduce my problmes.

@somera commented on GitHub (Feb 2, 2026): @rick-github I updated from 0.15.1 to 0.15.4 and dowloaded `glm-4.7-flash:q8_0` again. I don't see any fixes for the issue. But at the moment I can't reproduce my problmes.

GiteaMirror referenced this issue

2026-04-13 00:11:24 -05:00

[PR #9104] Refactor: Move root command to a separate file #12859

GiteaMirror referenced this issue

2026-04-16 06:25:33 -05:00

[PR #9104] Refactor: Move root command to a separate file #18130

GiteaMirror referenced this issue

2026-04-19 16:58:00 -05:00

[PR #9104] Refactor: Move root command to a separate file #23399

GiteaMirror referenced this issue

2026-04-22 12:21:42 -05:00

[GH-ISSUE #9086] Proposal: Incremental Refactor of CLI Command Structure #31673

GiteaMirror referenced this issue

2026-04-22 23:23:56 -05:00

[PR #9104] Refactor: Move root command to a separate file #38732

GiteaMirror referenced this issue

2026-04-24 23:38:26 -05:00

[PR #9104] Refactor: Move root command to a separate file #44107

GiteaMirror referenced this issue

2026-04-28 23:12:24 -05:00

[GH-ISSUE #9086] Proposal: Incremental Refactor of CLI Command Structure #52424

GiteaMirror referenced this issue

2026-04-29 14:29:16 -05:00

[PR #9104] Refactor: Move root command to a separate file #59556

GiteaMirror referenced this issue

2026-05-04 12:08:25 -05:00

[GH-ISSUE #9086] Proposal: Incremental Refactor of CLI Command Structure #67969

GiteaMirror referenced this issue

2026-05-05 07:35:04 -05:00

[PR #9104] Refactor: Move root command to a separate file #75153

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#9086