[GH-ISSUE #12069] 1h54m41s runnint time! #54530

Closed
opened 2026-04-29 06:15:33 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @geogesors on GitHub (Aug 25, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12069

What is the issue?

Here's the English translation of your query:

I'm using RAGFlow to connect to a local Ollama gpt-oss:20b model on an NVIDIA 3090 with 24GB of VRAM. After running for a long time, the process suddenly hangs. I have to manually run ollama stop gpt-oss:20b before Ollama will start working normally again.

When it hangs, there are no error logs. ollama ps shows that the model is still in memory and everything appears normal, but there's absolutely no output. After I stop the model, it reloads and starts outputting normally again, but the output timestamp shows "1h 54m 41s," which is exactly the duration from when it hung to when I restarted it.

What's going on? Can you help me?

Relevant log output

8月 25 18:28:47 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:28:47 | 200 |    2.079488ms |       127.0.0.1 | POST     "/api/generate"
8月 25 18:28:48 djjx ollama[1096449]: time=2025-08-25T18:28:48.944+08:00 level=INFO source=server.go:211 msg="enabling flash attention"
8月 25 18:28:48 djjx ollama[1096449]: time=2025-08-25T18:28:48.944+08:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
8月 25 18:28:48 djjx ollama[1096449]: time=2025-08-25T18:28:48.946+08:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 33301"
8月 25 18:28:48 djjx ollama[1096449]: time=2025-08-25T18:28:48.967+08:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
8月 25 18:28:48 djjx ollama[1096449]: time=2025-08-25T18:28:48.968+08:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:33301"
8月 25 18:28:49 djjx ollama[1096449]: time=2025-08-25T18:28:49.214+08:00 level=INFO source=server.go:488 msg="system memory" total="125.2 GiB" free="102.0 GiB" free_swap="15.7 GiB"
8月 25 18:28:49 djjx ollama[1096449]: time=2025-08-25T18:28:49.470+08:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 library=cuda parallel=1 required="15.1 GiB" gpus=1
8月 25 18:28:49 djjx ollama[1096449]: time=2025-08-25T18:28:49.731+08:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=25 layers.split=[25] memory.available="[22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.1 GiB" memory.required.partial="15.1 GiB" memory.required.kv="492.0 MiB" memory.required.allocations="[15.1 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="2.0 GiB" memory.graph.partial="2.0 GiB"
8月 25 18:28:49 djjx ollama[1096449]: time=2025-08-25T18:28:49.733+08:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16384 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-8ab954ea-f560-c304-bb45-f6ffddfa6398 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
8月 25 18:28:49 djjx ollama[1096449]: time=2025-08-25T18:28:49.820+08:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
8月 25 18:28:49 djjx ollama[1096449]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
8月 25 18:28:49 djjx ollama[1096449]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
8月 25 18:28:49 djjx ollama[1096449]: ggml_cuda_init: found 1 CUDA devices:
8月 25 18:28:49 djjx ollama[1096449]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-8ab954ea-f560-c304-bb45-f6ffddfa6398
8月 25 18:28:49 djjx ollama[1096449]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
8月 25 18:28:49 djjx ollama[1096449]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-sandybridge.so
8月 25 18:28:49 djjx ollama[1096449]: time=2025-08-25T18:28:49.920+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.164+08:00 level=INFO source=ggml.go:486 msg="offloading 24 repeating layers to GPU"
8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU"
8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="11.8 GiB"
8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="492.0 MiB"
8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="129.8 MiB"
8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=backend.go:342 msg="total memory" size="13.4 GiB"
8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.166+08:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model"
8月 25 18:28:58 djjx ollama[1096449]: time=2025-08-25T18:28:58.955+08:00 level=INFO source=server.go:1272 msg="llama runner started in 10.01 seconds"
8月 25 18:29:01 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:01 | 200 |      35.829µs |       127.0.0.1 | HEAD     "/"
8月 25 18:29:02 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:01 | 200 |      49.099µs |       127.0.0.1 | GET      "/api/ps"
8月 25 18:29:22 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:22 | 200 |          5m4s |      172.22.0.6 | POST     "/api/chat"
8月 25 18:29:31 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:31 | 200 |      42.673µs |       127.0.0.1 | HEAD     "/"
8月 25 18:29:31 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:31 | 200 |      47.912µs |       127.0.0.1 | GET      "/api/ps"
8月 25 18:29:42 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:42 | 200 |         5m24s |      172.22.0.6 | POST     "/api/chat"
8月 25 18:29:45 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:45 | 200 |         5m28s |      172.22.0.6 | POST     "/api/chat"
8月 25 18:29:59 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:59 | 200 |         5m41s |      172.22.0.6 | POST     "/api/chat"
8月 25 18:30:01 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:30:01 | 200 |      54.617µs |       127.0.0.1 | HEAD     "/"
8月 25 18:30:01 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:30:01 | 200 |      58.877µs |       127.0.0.1 | GET      "/api/ps"
8月 25 18:30:29 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:30:29 | 200 |      1h54m41s |      172.22.0.6 | POST     "/api/chat"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

ollama version is 0.11.5

Originally created by @geogesors on GitHub (Aug 25, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12069 ### What is the issue? Here's the English translation of your query: I'm using RAGFlow to connect to a local Ollama gpt-oss:20b model on an NVIDIA 3090 with 24GB of VRAM. After running for a long time, the process suddenly hangs. I have to manually run ollama stop gpt-oss:20b before Ollama will start working normally again. When it hangs, there are no error logs. ollama ps shows that the model is still in memory and everything appears normal, but there's absolutely no output. After I stop the model, it reloads and starts outputting normally again, but the output timestamp shows "1h 54m 41s," which is exactly the duration from when it hung to when I restarted it. What's going on? Can you help me? ### Relevant log output ```shell 8月 25 18:28:47 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:28:47 | 200 | 2.079488ms | 127.0.0.1 | POST "/api/generate" 8月 25 18:28:48 djjx ollama[1096449]: time=2025-08-25T18:28:48.944+08:00 level=INFO source=server.go:211 msg="enabling flash attention" 8月 25 18:28:48 djjx ollama[1096449]: time=2025-08-25T18:28:48.944+08:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" 8月 25 18:28:48 djjx ollama[1096449]: time=2025-08-25T18:28:48.946+08:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 33301" 8月 25 18:28:48 djjx ollama[1096449]: time=2025-08-25T18:28:48.967+08:00 level=INFO source=runner.go:1006 msg="starting ollama engine" 8月 25 18:28:48 djjx ollama[1096449]: time=2025-08-25T18:28:48.968+08:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:33301" 8月 25 18:28:49 djjx ollama[1096449]: time=2025-08-25T18:28:49.214+08:00 level=INFO source=server.go:488 msg="system memory" total="125.2 GiB" free="102.0 GiB" free_swap="15.7 GiB" 8月 25 18:28:49 djjx ollama[1096449]: time=2025-08-25T18:28:49.470+08:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 library=cuda parallel=1 required="15.1 GiB" gpus=1 8月 25 18:28:49 djjx ollama[1096449]: time=2025-08-25T18:28:49.731+08:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=25 layers.split=[25] memory.available="[22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.1 GiB" memory.required.partial="15.1 GiB" memory.required.kv="492.0 MiB" memory.required.allocations="[15.1 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="2.0 GiB" memory.graph.partial="2.0 GiB" 8月 25 18:28:49 djjx ollama[1096449]: time=2025-08-25T18:28:49.733+08:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16384 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-8ab954ea-f560-c304-bb45-f6ffddfa6398 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" 8月 25 18:28:49 djjx ollama[1096449]: time=2025-08-25T18:28:49.820+08:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 8月 25 18:28:49 djjx ollama[1096449]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 8月 25 18:28:49 djjx ollama[1096449]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 8月 25 18:28:49 djjx ollama[1096449]: ggml_cuda_init: found 1 CUDA devices: 8月 25 18:28:49 djjx ollama[1096449]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-8ab954ea-f560-c304-bb45-f6ffddfa6398 8月 25 18:28:49 djjx ollama[1096449]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so 8月 25 18:28:49 djjx ollama[1096449]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-sandybridge.so 8月 25 18:28:49 djjx ollama[1096449]: time=2025-08-25T18:28:49.920+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) 8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.164+08:00 level=INFO source=ggml.go:486 msg="offloading 24 repeating layers to GPU" 8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU" 8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU" 8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="11.8 GiB" 8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" 8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="492.0 MiB" 8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="129.8 MiB" 8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" 8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=backend.go:342 msg="total memory" size="13.4 GiB" 8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.165+08:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding" 8月 25 18:28:50 djjx ollama[1096449]: time=2025-08-25T18:28:50.166+08:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model" 8月 25 18:28:58 djjx ollama[1096449]: time=2025-08-25T18:28:58.955+08:00 level=INFO source=server.go:1272 msg="llama runner started in 10.01 seconds" 8月 25 18:29:01 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:01 | 200 | 35.829µs | 127.0.0.1 | HEAD "/" 8月 25 18:29:02 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:01 | 200 | 49.099µs | 127.0.0.1 | GET "/api/ps" 8月 25 18:29:22 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:22 | 200 | 5m4s | 172.22.0.6 | POST "/api/chat" 8月 25 18:29:31 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:31 | 200 | 42.673µs | 127.0.0.1 | HEAD "/" 8月 25 18:29:31 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:31 | 200 | 47.912µs | 127.0.0.1 | GET "/api/ps" 8月 25 18:29:42 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:42 | 200 | 5m24s | 172.22.0.6 | POST "/api/chat" 8月 25 18:29:45 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:45 | 200 | 5m28s | 172.22.0.6 | POST "/api/chat" 8月 25 18:29:59 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:29:59 | 200 | 5m41s | 172.22.0.6 | POST "/api/chat" 8月 25 18:30:01 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:30:01 | 200 | 54.617µs | 127.0.0.1 | HEAD "/" 8月 25 18:30:01 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:30:01 | 200 | 58.877µs | 127.0.0.1 | GET "/api/ps" 8月 25 18:30:29 djjx ollama[1096449]: [GIN] 2025/08/25 - 18:30:29 | 200 | 1h54m41s | 172.22.0.6 | POST "/api/chat" ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version ollama version is 0.11.5
GiteaMirror added the bugneeds more info labels 2026-04-29 06:15:33 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 25, 2025):

Does it hang, or is it generating tokens? If your client has stream:false, then the model may be generating tokens which are being accumulated by the server, but the client doesn't see any activity. It could be that the model has lost coherence and is generating tokens without reaching and end-of-sequence token. If so, you can make the generation stop early by setting num_predict.

<!-- gh-comment-id:3220318481 --> @rick-github commented on GitHub (Aug 25, 2025): Does it hang, or is it generating tokens? If your client has `stream:false`, then the model may be generating tokens which are being accumulated by the server, but the client doesn't see any activity. It could be that the model has lost coherence and is generating tokens without reaching and end-of-sequence token. If so, you can make the generation stop early by setting [`num_predict`](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#:~:text=stop%20%22AI%20assistant%3A%22-,num_predict,-Maximum%20number%20of).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54530