[GH-ISSUE #13461] Ollama crashes with 100% cpu on one core when near context limit or truncating #34642

Open
opened 2026-04-22 18:22:54 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @arlaneenalra on GitHub (Dec 13, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13461

What is the issue?

Since about 0.13.3 (might have been a bit earlier not really sure) I've noticed that Ollama will run fine for 1 request and then seemingly drop into a CPU spin loop burning 100% cpu and become at least partially unresponsive. This thread does not release memory it has allocated. This seems to happen any time the API triggers a truncation, though I'm not sure if it's the truncation or not just that seeing this log message:

Dec 13 19:16:40 framework ollama[11089]: time=2025-12-13T19:16:40.738Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=4096 prompt=11441 keep=4 new=4096

in the logs almost always means that I know have a thread of ollama burning CPU. So far, I've only seen this on my Strix Halo Linux machines running Vulkan. The model doesn't seem to matter too much, I've seen this behavior with gpt-oss:120b, ministral-3:14b, qwen3-next and a few others.

To reproduce, I usually do something like:

ollama run gpt-oss:120b

Output a list of the extended ascii character table as used on IBM compatible computers. In this list include the Decimal, hexadecimal, octal, and binary representations of the character codes, the purpose of non-printing characters as well as their alternative graphic representation. This table should include the original 128 characters as well as the extra 128 characters that were available on IBM compatible computers.

With that prompt it will sometimes hang outright in mid generation and seemingly drop into the same state, but without the truncation log message. Seems like there might be something happening when at or near the context limit. If set larger context windows, the crash does not seem to happen, so I'm almost certain it has something to do with manipulating the context storage in some manner but I haven't dug into that code.

Relevant log output

Hang mid generation:


Dec 13 19:33:03 framework systemd[1]: Started ollama.service - Ollama Service.
Dec 13 19:33:03 framework ollama[11994]: time=2025-12-13T19:33:03.036Z level=INFO source=routes.go:1554 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/jules/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:true ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Dec 13 19:33:03 framework ollama[11994]: time=2025-12-13T19:33:03.049Z level=INFO source=images.go:522 msg="total blobs: 201"
Dec 13 19:33:03 framework ollama[11994]: time=2025-12-13T19:33:03.051Z level=INFO source=images.go:529 msg="total unused blobs removed: 0"
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
Dec 13 19:33:03 framework ollama[11994]:  - using env:        export GIN_MODE=release
Dec 13 19:33:03 framework ollama[11994]:  - using code:        gin.SetMode(gin.ReleaseMode)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func4 (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /api/me                   --> github.com/ollama/ollama/server.(*Server).WhoamiHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /api/signout              --> github.com/ollama/ollama/server.(*Server).SignoutHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] DELETE /api/user/keys/:encodedKey --> github.com/ollama/ollama/server.(*Server).SignoutHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST   /v1/responses             --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
Dec 13 19:33:03 framework ollama[11994]: time=2025-12-13T19:33:03.051Z level=INFO source=routes.go:1607 msg="Listening on [::]:11434 (version 0.0.0)"
Dec 13 19:33:03 framework ollama[11994]: time=2025-12-13T19:33:03.052Z level=INFO source=runner.go:67 msg="discovering available GPUs..."
Dec 13 19:33:03 framework ollama[11994]: time=2025-12-13T19:33:03.052Z level=INFO source=server.go:429 msg="starting runner" cmd="/opt/ollama-0.13.4-rc1/bin/ollama runner --ollama-engine --port 43023"
Dec 13 19:33:03 framework ollama[11994]: time=2025-12-13T19:33:03.100Z level=INFO source=types.go:42 msg="inference compute" id=00000000-c300-0000-0000-000000000000 filter_id="" library=Vulkan compute=0.0 name=Vulkan0 description="AMD Radeon 8060S (RADV GFX1151)" libdirs=ollama driver=0.0 pci_id=0000:c3:00.0 type=iGPU total="117.7 GiB" available="117.5 GiB"
Dec 13 19:33:10 framework ollama[11994]: [GIN] 2025/12/13 - 19:33:10 | 200 |      49.123µs |       127.0.0.1 | HEAD     "/"
Dec 13 19:33:10 framework ollama[11994]: [GIN] 2025/12/13 - 19:33:10 | 200 |   71.524431ms |       127.0.0.1 | POST     "/api/show"
Dec 13 19:33:10 framework ollama[11994]: [GIN] 2025/12/13 - 19:33:10 | 200 |    71.88595ms |       127.0.0.1 | POST     "/api/show"
Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.034Z level=INFO source=server.go:429 msg="starting runner" cmd="/opt/ollama-0.13.4-rc1/bin/ollama runner --ollama-engine --port 33903"
Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.158Z level=INFO source=server.go:245 msg="enabling flash attention"
Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.158Z level=INFO source=server.go:429 msg="starting runner" cmd="/opt/ollama-0.13.4-rc1/bin/ollama runner --ollama-engine --model /home/jules/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 33963"
Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.159Z level=INFO source=sched.go:443 msg="system memory" total="125.1 GiB" free="116.7 GiB" free_swap="8.0 GiB"
Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.159Z level=INFO source=sched.go:450 msg="gpu memory" id=00000000-c300-0000-0000-000000000000 library=Vulkan available="117.1 GiB" free="117.5 GiB" minimum="457.0 MiB" overhead="0 B"
Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.159Z level=INFO source=server.go:746 msg="loading model" "model layers"=37 requested=-1
Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.167Z level=INFO source=runner.go:1405 msg="starting ollama engine"
Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.168Z level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:33963"
Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.171Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:37[ID:00000000-c300-0000-0000-000000000000 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.205Z level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Dec 13 19:33:11 framework ollama[11994]: ggml_vulkan: Found 1 Vulkan devices:
Dec 13 19:33:11 framework ollama[11994]: ggml_vulkan: 0 = AMD Radeon 8060S (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
Dec 13 19:33:11 framework ollama[11994]: load_backend: loaded Vulkan backend from /opt/ollama-0.13.4-rc1/lib/ollama/libggml-vulkan.so
Dec 13 19:33:11 framework ollama[11994]: load_backend: loaded CPU backend from /opt/ollama-0.13.4-rc1/lib/ollama/libggml-cpu-icelake.so
Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.233Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
Dec 13 19:33:11 framework ollama[11994]: ggml_backend_vk_get_device_memory called: uuid 00000000-c300-0000-0000-000000000000
Dec 13 19:33:11 framework ollama[11994]: ggml_backend_vk_get_device_memory called: luid 0x0000000000000000
Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.253Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:37[ID:00000000-c300-0000-0000-000000000000 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Dec 13 19:33:11 framework ollama[11994]: ggml_backend_vk_get_device_memory called: uuid 00000000-c300-0000-0000-000000000000
Dec 13 19:33:11 framework ollama[11994]: ggml_backend_vk_get_device_memory called: luid 0x0000000000000000
Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:37[ID:00000000-c300-0000-0000-000000000000 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=ggml.go:482 msg="offloading 36 repeating layers to GPU"
Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=ggml.go:494 msg="offloaded 37/37 layers to GPU"
Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="59.8 GiB"
Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.1 GiB"
Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="450.0 MiB"
Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=device.go:262 msg="compute graph" device=Vulkan0 size="125.1 MiB"
Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB"
Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=device.go:272 msg="total memory" size="61.4 GiB"
Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=sched.go:517 msg="loaded runners" count=1
Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=server.go:1338 msg="waiting for llama runner to start responding"
Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model"
Dec 13 19:33:30 framework ollama[11994]: time=2025-12-13T19:33:30.900Z level=INFO source=server.go:1376 msg="llama runner started in 19.74 seconds"
Dec 13 19:33:30 framework ollama[11994]: [GIN] 2025/12/13 - 19:33:30 | 200 | 20.008054023s |       127.0.0.1 | POST     "/api/generate"
Dec 13 19:35:43 framework ollama[11994]: [GIN] 2025/12/13 - 19:35:43 | 200 |      72.356µs |       127.0.0.1 | GET      "/api/version"
Dec 13 19:39:54 framework ollama[11994]: [GIN] 2025/12/13 - 19:39:54 | 200 |          5m0s |       127.0.0.1 | POST     "/api/chat"

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

Source built 0.13.4-rc1 (Vulkan SDK vulkansdk-linux-x86_64-1.4.335.0)

Originally created by @arlaneenalra on GitHub (Dec 13, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13461 ### What is the issue? Since about 0.13.3 (might have been a bit earlier not really sure) I've noticed that Ollama will run fine for 1 request and then seemingly drop into a CPU spin loop burning 100% cpu and become at least partially unresponsive. This thread does not release memory it has allocated. This seems to happen any time the API triggers a truncation, though I'm not sure if it's the truncation or not just that seeing this log message: ``` Dec 13 19:16:40 framework ollama[11089]: time=2025-12-13T19:16:40.738Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=4096 prompt=11441 keep=4 new=4096 ``` in the logs almost always means that I know have a thread of ollama burning CPU. So far, I've only seen this on my Strix Halo Linux machines running Vulkan. The model doesn't seem to matter too much, I've seen this behavior with gpt-oss:120b, ministral-3:14b, qwen3-next and a few others. To reproduce, I usually do something like: ``` ollama run gpt-oss:120b Output a list of the extended ascii character table as used on IBM compatible computers. In this list include the Decimal, hexadecimal, octal, and binary representations of the character codes, the purpose of non-printing characters as well as their alternative graphic representation. This table should include the original 128 characters as well as the extra 128 characters that were available on IBM compatible computers. ``` With that prompt it will sometimes hang outright in mid generation and seemingly drop into the same state, but without the truncation log message. Seems like there might be something happening when at or near the context limit. If set larger context windows, the crash does not seem to happen, so I'm almost certain it has something to do with manipulating the context storage in some manner but I haven't dug into that code. ### Relevant log output ```shell Hang mid generation: Dec 13 19:33:03 framework systemd[1]: Started ollama.service - Ollama Service. Dec 13 19:33:03 framework ollama[11994]: time=2025-12-13T19:33:03.036Z level=INFO source=routes.go:1554 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/jules/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:true ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Dec 13 19:33:03 framework ollama[11994]: time=2025-12-13T19:33:03.049Z level=INFO source=images.go:522 msg="total blobs: 201" Dec 13 19:33:03 framework ollama[11994]: time=2025-12-13T19:33:03.051Z level=INFO source=images.go:529 msg="total unused blobs removed: 0" Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. Dec 13 19:33:03 framework ollama[11994]: - using env: export GIN_MODE=release Dec 13 19:33:03 framework ollama[11994]: - using code: gin.SetMode(gin.ReleaseMode) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func4 (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /api/me --> github.com/ollama/ollama/server.(*Server).WhoamiHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /api/signout --> github.com/ollama/ollama/server.(*Server).SignoutHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] DELETE /api/user/keys/:encodedKey --> github.com/ollama/ollama/server.(*Server).SignoutHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) Dec 13 19:33:03 framework ollama[11994]: [GIN-debug] POST /v1/responses --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) Dec 13 19:33:03 framework ollama[11994]: time=2025-12-13T19:33:03.051Z level=INFO source=routes.go:1607 msg="Listening on [::]:11434 (version 0.0.0)" Dec 13 19:33:03 framework ollama[11994]: time=2025-12-13T19:33:03.052Z level=INFO source=runner.go:67 msg="discovering available GPUs..." Dec 13 19:33:03 framework ollama[11994]: time=2025-12-13T19:33:03.052Z level=INFO source=server.go:429 msg="starting runner" cmd="/opt/ollama-0.13.4-rc1/bin/ollama runner --ollama-engine --port 43023" Dec 13 19:33:03 framework ollama[11994]: time=2025-12-13T19:33:03.100Z level=INFO source=types.go:42 msg="inference compute" id=00000000-c300-0000-0000-000000000000 filter_id="" library=Vulkan compute=0.0 name=Vulkan0 description="AMD Radeon 8060S (RADV GFX1151)" libdirs=ollama driver=0.0 pci_id=0000:c3:00.0 type=iGPU total="117.7 GiB" available="117.5 GiB" Dec 13 19:33:10 framework ollama[11994]: [GIN] 2025/12/13 - 19:33:10 | 200 | 49.123µs | 127.0.0.1 | HEAD "/" Dec 13 19:33:10 framework ollama[11994]: [GIN] 2025/12/13 - 19:33:10 | 200 | 71.524431ms | 127.0.0.1 | POST "/api/show" Dec 13 19:33:10 framework ollama[11994]: [GIN] 2025/12/13 - 19:33:10 | 200 | 71.88595ms | 127.0.0.1 | POST "/api/show" Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.034Z level=INFO source=server.go:429 msg="starting runner" cmd="/opt/ollama-0.13.4-rc1/bin/ollama runner --ollama-engine --port 33903" Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.158Z level=INFO source=server.go:245 msg="enabling flash attention" Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.158Z level=INFO source=server.go:429 msg="starting runner" cmd="/opt/ollama-0.13.4-rc1/bin/ollama runner --ollama-engine --model /home/jules/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 33963" Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.159Z level=INFO source=sched.go:443 msg="system memory" total="125.1 GiB" free="116.7 GiB" free_swap="8.0 GiB" Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.159Z level=INFO source=sched.go:450 msg="gpu memory" id=00000000-c300-0000-0000-000000000000 library=Vulkan available="117.1 GiB" free="117.5 GiB" minimum="457.0 MiB" overhead="0 B" Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.159Z level=INFO source=server.go:746 msg="loading model" "model layers"=37 requested=-1 Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.167Z level=INFO source=runner.go:1405 msg="starting ollama engine" Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.168Z level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:33963" Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.171Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:37[ID:00000000-c300-0000-0000-000000000000 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.205Z level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Dec 13 19:33:11 framework ollama[11994]: ggml_vulkan: Found 1 Vulkan devices: Dec 13 19:33:11 framework ollama[11994]: ggml_vulkan: 0 = AMD Radeon 8060S (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat Dec 13 19:33:11 framework ollama[11994]: load_backend: loaded Vulkan backend from /opt/ollama-0.13.4-rc1/lib/ollama/libggml-vulkan.so Dec 13 19:33:11 framework ollama[11994]: load_backend: loaded CPU backend from /opt/ollama-0.13.4-rc1/lib/ollama/libggml-cpu-icelake.so Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.233Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) Dec 13 19:33:11 framework ollama[11994]: ggml_backend_vk_get_device_memory called: uuid 00000000-c300-0000-0000-000000000000 Dec 13 19:33:11 framework ollama[11994]: ggml_backend_vk_get_device_memory called: luid 0x0000000000000000 Dec 13 19:33:11 framework ollama[11994]: time=2025-12-13T19:33:11.253Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:37[ID:00000000-c300-0000-0000-000000000000 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Dec 13 19:33:11 framework ollama[11994]: ggml_backend_vk_get_device_memory called: uuid 00000000-c300-0000-0000-000000000000 Dec 13 19:33:11 framework ollama[11994]: ggml_backend_vk_get_device_memory called: luid 0x0000000000000000 Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:37[ID:00000000-c300-0000-0000-000000000000 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=ggml.go:482 msg="offloading 36 repeating layers to GPU" Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU" Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=ggml.go:494 msg="offloaded 37/37 layers to GPU" Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="59.8 GiB" Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.1 GiB" Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="450.0 MiB" Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=device.go:262 msg="compute graph" device=Vulkan0 size="125.1 MiB" Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB" Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=device.go:272 msg="total memory" size="61.4 GiB" Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=sched.go:517 msg="loaded runners" count=1 Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=server.go:1338 msg="waiting for llama runner to start responding" Dec 13 19:33:14 framework ollama[11994]: time=2025-12-13T19:33:14.103Z level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model" Dec 13 19:33:30 framework ollama[11994]: time=2025-12-13T19:33:30.900Z level=INFO source=server.go:1376 msg="llama runner started in 19.74 seconds" Dec 13 19:33:30 framework ollama[11994]: [GIN] 2025/12/13 - 19:33:30 | 200 | 20.008054023s | 127.0.0.1 | POST "/api/generate" Dec 13 19:35:43 framework ollama[11994]: [GIN] 2025/12/13 - 19:35:43 | 200 | 72.356µs | 127.0.0.1 | GET "/api/version" Dec 13 19:39:54 framework ollama[11994]: [GIN] 2025/12/13 - 19:39:54 | 200 | 5m0s | 127.0.0.1 | POST "/api/chat" ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version Source built 0.13.4-rc1 (Vulkan SDK vulkansdk-linux-x86_64-1.4.335.0)
GiteaMirror added the vulkanbug labels 2026-04-22 18:22:54 -05:00
Author
Owner

@arlaneenalra commented on GitHub (Dec 13, 2025):

Follow on. I ran:

ollama run gpt-120b

With these prompts:

Output a list of the first 128 ascii characters as used on IBM compatible computers. In this list include the Decimal, hexadecimal, octal, and binary representations of the character codes, the purpose of non-printing characters as well as their alternative graphic representation.
Ok now do the same thing for the 128 characters of the extended ascii table as used on the IBM PC.

And got a hard crash:

| 205 | 0xCD | 0355 | 11001101 | ═ | U+2550 | Box drawings double horizontal |
| 206 | 0xCE | 0356 | 11001110 | ╬ | U+256C | Box drawings double vertical & horizontal |
| 207 | 0xCF | 0357 | 11001111 | ¤ | U+00A4 | Currency sign |
| 208 | 0xD0 | 0360 | 11010000 | ð | U+00F0 | Latin small eth |
| 209 | 0xD1 | 0361 | 11010001 | Ð | U+00D0 | Latin capital eth |
|Error: an error was encountered while running the model: unexpected EOF

logs2.txt

<!-- gh-comment-id:3649758717 --> @arlaneenalra commented on GitHub (Dec 13, 2025): Follow on. I ran: ``` ollama run gpt-120b ``` With these prompts: ``` Output a list of the first 128 ascii characters as used on IBM compatible computers. In this list include the Decimal, hexadecimal, octal, and binary representations of the character codes, the purpose of non-printing characters as well as their alternative graphic representation. ``` ``` Ok now do the same thing for the 128 characters of the extended ascii table as used on the IBM PC. ``` And got a hard crash: ``` | 205 | 0xCD | 0355 | 11001101 | ═ | U+2550 | Box drawings double horizontal | | 206 | 0xCE | 0356 | 11001110 | ╬ | U+256C | Box drawings double vertical & horizontal | | 207 | 0xCF | 0357 | 11001111 | ¤ | U+00A4 | Currency sign | | 208 | 0xD0 | 0360 | 11010000 | ð | U+00F0 | Latin small eth | | 209 | 0xD1 | 0361 | 11010001 | Ð | U+00D0 | Latin capital eth | |Error: an error was encountered while running the model: unexpected EOF ``` [logs2.txt](https://github.com/user-attachments/files/24145368/logs2.txt)
Author
Owner

@arlaneenalra commented on GitHub (Dec 13, 2025):

Vulkan Info:

vulkaninfo.txt

jules@framework:~/code/1.4.335.0$ uname -a
Linux framework 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
<!-- gh-comment-id:3649760766 --> @arlaneenalra commented on GitHub (Dec 13, 2025): Vulkan Info: [vulkaninfo.txt](https://github.com/user-attachments/files/24145385/vulkaninfo.txt) ``` jules@framework:~/code/1.4.335.0$ uname -a Linux framework 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 x86_64 x86_64 GNU/Linux ```
Author
Owner

@fmu83 commented on GitHub (Jan 17, 2026):

Same issue here.
gpt-oss:20b-ctx16384

CPU: Intel N100
GPU: Intel Arc Pro B50, 16 GB VRAM

OLLAMA version is 0.14.3-rc1
Vulkan Instance Version: 1.4.304
Mesa 25.3.3
firmware-intel-graphics 20251021-1~bpo13+1

Ollama Model hungs every few hours after a large request
Jan 17 11:35:34 intel-ai ollama[610]: time=2026-01-17T11:35:34.179Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=16384 prompt=28147 keep=4 new=16384

Ollama in general is responsive. I'm able to make a API call to list the models. But the model itself seems to be crashed. It is unresponsive if I try to do a "ollama run" and spinnes endlessly.

Stacktrace after "kill -QUIT OLLAMAPID:

SIGQUIT: quit
PC=0x59093d02f9c1 m=0 sigcode=0
goroutine 0 gp=0x59093efbf180 m=0 mp=0x59093efc0f40 [idle]:
runtime.futex(0x59093efc1080, 0x80, 0x0, 0x0, 0x0, 0x0)
runtime/sys_linux_amd64.s:557 +0x21 fp=0x7ffeef8bc1a8 sp=0x7ffeef8bc1a0 pc=0x59093d02f9c1
runtime.futexsleep(0x7ffeef8bc220?, 0x3cfc8611?, 0x59093d02f5ad?)
runtime/os_linux.go:75 +0x30 fp=0x7ffeef8bc1f8 sp=0x7ffeef8bc1a8 pc=0x59093cfebd70
runtime.notesleep(0x59093efc1080)
runtime/lock_futex.go:47 +0x87 fp=0x7ffeef8bc230 sp=0x7ffeef8bc1f8 pc=0x59093cfc7d27
runtime.mPark(...)
runtime/proc.go:1887
runtime.stopm()
runtime/proc.go:2907 +0x8c fp=0x7ffeef8bc260 sp=0x7ffeef8bc230 pc=0x59093cff75cc
runtime.findRunnable()
runtime/proc.go:3644 +0xd9c fp=0x7ffeef8bc3d8 sp=0x7ffeef8bc260 pc=0x59093cff909c
runtime.schedule()
runtime/proc.go:4017 +0xb1 fp=0x7ffeef8bc410 sp=0x7ffeef8bc3d8 pc=0x59093cffa191
runtime.park_m(0xc000003340)
runtime/proc.go:4141 +0x285 fp=0x7ffeef8bc470 sp=0x7ffeef8bc410 pc=0x59093cffa645
runtime.mcall()
runtime/asm_amd64.s:459 +0x50 fp=0x7ffeef8bc488 sp=0x7ffeef8bc470 pc=0x59093d02bb70
goroutine 1 gp=0xc000002380 m=nil [IO wait, 28 minutes]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
runtime/proc.go:435 +0xce fp=0xc00134b790 sp=0xc00134b770 pc=0x59093d025d2e
runtime.netpollblock(0xc00134b7e0?, 0x3cfbf466?, 0x9?)
runtime/netpoll.go:575 +0xf7 fp=0xc00134b7c8 sp=0xc00134b790 pc=0x59093cfeb057
internal/poll.runtime_pollWait(0x7e3973ec6eb0, 0x72)
runtime/netpoll.go:351 +0x85 fp=0xc00134b7e8 sp=0xc00134b7c8 pc=0x59093d024f45
internal/poll.(*pollDesc).wait(0xc00011f700?, 0x900fc965e?, 0x0)
internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00134b810 sp=0xc00134b7e8 pc=0x59093d0ad0c7
internal/poll.(*pollDesc).waitRead(...)
internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc00011f700)
internal/poll/fd_unix.go:620 +0x295 fp=0xc00134b8b8 sp=0xc00134b810 pc=0x59093d0b2495
net.(*netFD).accept(0xc00011f700)
net/fd_unix.go:172 +0x29 fp=0xc00134b970 sp=0xc00134b8b8 pc=0x59093d125549
net.(*TCPListener).accept(0xc000525ec0)
net/tcpsock_posix.go:159 +0x1b fp=0xc00134b9c0 sp=0xc00134b970 pc=0x59093d13b45b
net.(*TCPListener).Accept(0xc000525ec0)
net/tcpsock.go:380 +0x30 fp=0xc00134b9f0 sp=0xc00134b9c0 pc=0x59093d13a310
net/http.(*onceCloseListener).Accept(0xc0000e9dd0?)
:1 +0x24 fp=0xc00134ba08 sp=0xc00134b9f0 pc=0x59093d3520c4
net/http.(*Server).Serve(0xc0001ef500, {0x59093e699a40, 0xc000525ec0})
net/http/server.go:3424 +0x30c fp=0xc00134bb38 sp=0xc00134ba08 pc=0x59093d32998c
github.com/ollama/ollama/runner/ollamarunner.Execute({0xc0000340a0, 0x4, 0x4})
github.com/ollama/ollama/runner/ollamarunner/runner.go:1441 +0x94e fp=0xc00134bd08 sp=0xc00134bb38 pc=0x59093d591f6e
github.com/ollama/ollama/runner.Execute({0xc000034080?, 0x0?, 0x0?})
github.com/ollama/ollama/runner/runner.go:28 +0x125 fp=0xc00134bd30 sp=0xc00134bd08 pc=0x59093d5bdba5
github.com/ollama/ollama/cmd.NewCLI.func3(0xc0001ef300?, {0x59093e1350e6?, 0x4?, 0x59093e1350ea?})
github.com/ollama/ollama/cmd/cmd.go:1961 +0x45 fp=0xc00134bd58 sp=0xc00134bd30 pc=0x59093dd81125
github.com/spf13/cobra.(*Command).execute(0xc000149808, {0xc000527bd0, 0x5, 0x5})
github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc00134be78 sp=0xc00134bd58 pc=0x59093d19f4bc
github.com/spf13/cobra.(*Command).ExecuteC(0xc00052a908)
github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc00134bf30 sp=0xc00134be78 pc=0x59093d19fd05
github.com/spf13/cobra.(*Command).Execute(...)
github.com/spf13/cobra@v1.7.0/command.go:992
github.com/spf13/cobra.(*Command).ExecuteContext(...)
github.com/spf13/cobra@v1.7.0/command.go:985
main.main()
github.com/ollama/ollama/main.go:12 +0x4d fp=0xc00134bf50 sp=0xc00134bf30 pc=0x59093dd81c0d
runtime.main()
runtime/proc.go:283 +0x29d fp=0xc00134bfe0 sp=0xc00134bf50 pc=0x59093cff26dd
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc00134bfe8 sp=0xc00134bfe0 pc=0x59093d02dbc1
goroutine 2 gp=0xc000002e00 m=nil [force gc (idle), 2 minutes]:
runtime.gopark(0x1bcd18a6059f?, 0x0?, 0x0?, 0x0?, 0x0?)
runtime/proc.go:435 +0xce fp=0xc000064fa8 sp=0xc000064f88 pc=0x59093d025d2e
runtime.goparkunlock(...)
runtime/proc.go:441
runtime.forcegchelper()
runtime/proc.go:348 +0xb8 fp=0xc000064fe0 sp=0xc000064fa8 pc=0x59093cff2a18
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc000064fe8 sp=0xc000064fe0 pc=0x59093d02dbc1
created by runtime.init.7 in goroutine 1
runtime/proc.go:336 +0x1a
goroutine 3 gp=0xc000003340 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
runtime/proc.go:435 +0xce fp=0xc000065780 sp=0xc000065760 pc=0x59093d025d2e
runtime.goparkunlock(...)
runtime/proc.go:441
runtime.bgsweep(0xc00007e000)
runtime/mgcsweep.go:316 +0xdf fp=0xc0000657c8 sp=0xc000065780 pc=0x59093cfdd1bf
runtime.gcenable.gowrap1()
runtime/mgc.go:204 +0x25 fp=0xc0000657e0 sp=0xc0000657c8 pc=0x59093cfd15a5
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc0000657e8 sp=0xc0000657e0 pc=0x59093d02dbc1
created by runtime.gcenable in goroutine 1
runtime/mgc.go:204 +0x66
goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait, 2 minutes]:
runtime.gopark(0x16208a7?, 0x152d9d3?, 0x0?, 0x0?, 0x0?)
runtime/proc.go:435 +0xce fp=0xc000065f78 sp=0xc000065f58 pc=0x59093d025d2e
runtime.goparkunlock(...)
runtime/proc.go:441
runtime.(*scavengerState).park(0x59093efbe120)
runtime/mgcscavenge.go:425 +0x49 fp=0xc000065fa8 sp=0xc000065f78 pc=0x59093cfdac09
runtime.bgscavenge(0xc00007e000)
runtime/mgcscavenge.go:658 +0x59 fp=0xc000065fc8 sp=0xc000065fa8 pc=0x59093cfdb199
runtime.gcenable.gowrap2()
runtime/mgc.go:205 +0x25 fp=0xc000065fe0 sp=0xc000065fc8 pc=0x59093cfd1545
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc000065fe8 sp=0xc000065fe0 pc=0x59093d02dbc1
created by runtime.gcenable in goroutine 1
runtime/mgc.go:205 +0xa5
goroutine 5 gp=0xc000003dc0 m=nil [finalizer wait, 131 minutes]:
runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000064688?)
runtime/proc.go:435 +0xce fp=0xc000064630 sp=0xc000064610 pc=0x59093d025d2e
runtime.runfinq()
runtime/mfinal.go:196 +0x107 fp=0xc0000647e0 sp=0xc000064630 pc=0x59093cfd0567
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc0000647e8 sp=0xc0000647e0 pc=0x59093d02dbc1
created by runtime.createfing in goroutine 1
runtime/mfinal.go:166 +0x3d
goroutine 6 gp=0xc0001cc8c0 m=nil [chan receive, 2 minutes]:
runtime.gopark(0xc00021fae0?, 0xc0005083f0?, 0x60?, 0x67?, 0x59093d10c188?)
runtime/proc.go:435 +0xce fp=0xc000066718 sp=0xc0000666f8 pc=0x59093d025d2e
runtime.chanrecv(0xc00009c310, 0x0, 0x1)
runtime/chan.go:664 +0x445 fp=0xc000066790 sp=0xc000066718 pc=0x59093cfc2045
runtime.chanrecv1(0x0?, 0x0?)
runtime/chan.go:506 +0x12 fp=0xc0000667b8 sp=0xc000066790 pc=0x59093cfc1bd2
runtime.unique_runtime_registerUniqueMapCleanup.func2(...)
runtime/mgc.go:1796
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
runtime/mgc.go:1799 +0x2f fp=0xc0000667e0 sp=0xc0000667b8 pc=0x59093cfd474f
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc0000667e8 sp=0xc0000667e0 pc=0x59093d02dbc1
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
runtime/mgc.go:1794 +0x85
goroutine 7 gp=0xc0001cd180 m=nil [GC worker (idle)]:
runtime.gopark(0x1bcd1a505d78?, 0x1?, 0x67?, 0x85?, 0x0?)
runtime/proc.go:435 +0xce fp=0xc000066f38 sp=0xc000066f18 pc=0x59093d025d2e
runtime.gcBgMarkWorker(0xc00009d730)
runtime/mgc.go:1423 +0xe9 fp=0xc000066fc8 sp=0xc000066f38 pc=0x59093cfd3a69
runtime.gcBgMarkStartWorkers.gowrap1()
runtime/mgc.go:1339 +0x25 fp=0xc000066fe0 sp=0xc000066fc8 pc=0x59093cfd3945
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc000066fe8 sp=0xc000066fe0 pc=0x59093d02dbc1
created by runtime.gcBgMarkStartWorkers in goroutine 1
runtime/mgc.go:1339 +0x105
goroutine 8 gp=0xc0001cd340 m=nil [GC worker (idle), 66 minutes]:
runtime.gopark(0x1832ed8786eb?, 0x3?, 0xa4?, 0x5c?, 0x0?)
runtime/proc.go:435 +0xce fp=0xc000067738 sp=0xc000067718 pc=0x59093d025d2e
runtime.gcBgMarkWorker(0xc00009d730)
runtime/mgc.go:1423 +0xe9 fp=0xc0000677c8 sp=0xc000067738 pc=0x59093cfd3a69
runtime.gcBgMarkStartWorkers.gowrap1()
runtime/mgc.go:1339 +0x25 fp=0xc0000677e0 sp=0xc0000677c8 pc=0x59093cfd3945
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc0000677e8 sp=0xc0000677e0 pc=0x59093d02dbc1
created by runtime.gcBgMarkStartWorkers in goroutine 1
runtime/mgc.go:1339 +0x105
goroutine 9 gp=0xc0001cd500 m=nil [GC worker (idle)]:
runtime.gopark(0x59093f08da60?, 0x1?, 0x91?, 0xfe?, 0x0?)
runtime/proc.go:435 +0xce fp=0xc000067f38 sp=0xc000067f18 pc=0x59093d025d2e
runtime.gcBgMarkWorker(0xc00009d730)
runtime/mgc.go:1423 +0xe9 fp=0xc000067fc8 sp=0xc000067f38 pc=0x59093cfd3a69
runtime.gcBgMarkStartWorkers.gowrap1()
runtime/mgc.go:1339 +0x25 fp=0xc000067fe0 sp=0xc000067fc8 pc=0x59093cfd3945
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc000067fe8 sp=0xc000067fe0 pc=0x59093d02dbc1
created by runtime.gcBgMarkStartWorkers in goroutine 1
runtime/mgc.go:1339 +0x105
goroutine 18 gp=0xc000102380 m=nil [GC worker (idle), 37 minutes]:
runtime.gopark(0x19c7179906cb?, 0x3?, 0x1b?, 0xfa?, 0x0?)
runtime/proc.go:435 +0xce fp=0xc000060738 sp=0xc000060718 pc=0x59093d025d2e
runtime.gcBgMarkWorker(0xc00009d730)
runtime/mgc.go:1423 +0xe9 fp=0xc0000607c8 sp=0xc000060738 pc=0x59093cfd3a69
runtime.gcBgMarkStartWorkers.gowrap1()
runtime/mgc.go:1339 +0x25 fp=0xc0000607e0 sp=0xc0000607c8 pc=0x59093cfd3945
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc0000607e8 sp=0xc0000607e0 pc=0x59093d02dbc1
created by runtime.gcBgMarkStartWorkers in goroutine 1
runtime/mgc.go:1339 +0x105
goroutine 10 gp=0xc000540700 m=10 mp=0xc00009f808 [syscall, 37 minutes]:
runtime.cgocall(0x59093ddeffe5, 0xc007033318)
runtime/cgocall.go:167 +0x4b fp=0xc0070332f0 sp=0xc0070332b8 pc=0x59093d0228ab
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(0x7e395c0017b0, 0x7e35a433fd50)
cgo_gotypes.go:977 +0x4a fp=0xc007033318 sp=0xc0070332f0 pc=0x59093d4a390a
github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify.func2(...)
github.com/ollama/ollama/ml/backend/ggml/ggml.go:825
github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify(0xc0002d0180, 0x0?, {0x0, 0x0, 0xc007033518?})
github.com/ollama/ollama/ml/backend/ggml/ggml.go:825 +0x1b2 fp=0xc0070333f0 sp=0xc007033318 pc=0x59093d4b12d2
github.com/ollama/ollama/ml/backend/ggml.(*Context).Compute(0xc0002d0180?, {0x0?, 0xc0002d0180?, 0x59093e6b26a0?})
github.com/ollama/ollama/ml/backend/ggml/ggml.go:811 +0x25 fp=0xc007033428 sp=0xc0070333f0 pc=0x59093d4b10e5
github.com/ollama/ollama/kvcache.(*Causal).shift(0xc0001ef600, 0x0, 0x4, 0xffffe002)
github.com/ollama/ollama/kvcache/causal.go:608 +0x250 fp=0xc007033588 sp=0xc007033428 pc=0x59093d49f030
github.com/ollama/ollama/kvcache.(*Causal).Remove(0xc0001ef600, 0x0, 0x4, 0x2002)
github.com/ollama/ollama/kvcache/causal.go:659 +0x285 fp=0xc007033620 sp=0xc007033588 pc=0x59093d49f6c5
github.com/ollama/ollama/kvcache.(*WrapperCache).Remove(0xc000114890?, 0x0, 0x4, 0x2002)
github.com/ollama/ollama/kvcache/wrapper.go:103 +0x5e fp=0xc007033658 sp=0xc007033620 pc=0x59093d4a0b3e
github.com/ollama/ollama/runner/ollamarunner.(*InputCache).ShiftCacheSlot(0xc00302c880, 0xc00053a600, 0x4)
github.com/ollama/ollama/runner/ollamarunner/cache.go:290 +0x34c fp=0xc0070337f0 sp=0xc007033658 pc=0x59093d5864ec
github.com/ollama/ollama/runner/ollamarunner.(*Server).forwardBatch(
, {0x110b, {0x59093e6a7670, 0xc002ffa080}, {0x59093e6b26a0, 0xc00125b410}, {0xc000232008, 0x3fc, 0x3ff}, {{0x59093e6b26a0, ...}, ...}, ...})
github.com/ollama/ollama/runner/ollamarunner/runner.go:565 +0xec5 fp=0xc007033b58 sp=0xc0070337f0 pc=0x59093d589c85
github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc0002230e0, {0x59093e69c0a0, 0xc000527c70})
github.com/ollama/ollama/runner/ollamarunner/runner.go:452 +0x18c fp=0xc007033fb8 sp=0xc007033b58 pc=0x59093d588b6c
github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap1()
github.com/ollama/ollama/runner/ollamarunner/runner.go:1418 +0x28 fp=0xc007033fe0 sp=0xc007033fb8 pc=0x59093d5921e8
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc007033fe8 sp=0xc007033fe0 pc=0x59093d02dbc1
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
github.com/ollama/ollama/runner/ollamarunner/runner.go:1418 +0x4c9
goroutine 9306 gp=0xc000808c40 m=nil [sync.Mutex.Lock, 28 minutes]:
runtime.gopark(0x0?, 0xc001347710?, 0xfe?, 0x25?, 0xc00009c5b0?)
runtime/proc.go:435 +0xce fp=0xc0013476e0 sp=0xc0013476c0 pc=0x59093d025d2e
runtime.goparkunlock(...)
runtime/proc.go:441
runtime.semacquire1(0xc0002231dc, 0x0, 0x3, 0x2, 0x15)
runtime/sema.go:188 +0x229 fp=0xc001347748 sp=0xc0013476e0 pc=0x59093d005ca9
internal/sync.runtime_SemacquireMutex(0xc0013477c0?, 0x9f?, 0x59093e526e00?)
runtime/sema.go:95 +0x25 fp=0xc001347780 sp=0xc001347748 pc=0x59093d027545
internal/sync.(*Mutex).lockSlow(0xc0002231d8)
internal/sync/mutex.go:149 +0x15d fp=0xc0013477d0 sp=0xc001347780 pc=0x59093d03769d
internal/sync.(*Mutex).Lock(...)
internal/sync/mutex.go:70
sync.(*Mutex).Lock(...)
sync/mutex.go:46
github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(0xc0002230e0, {0x59093e699c20, 0xc0001622a0}, 0xc0004963c0)
github.com/ollama/ollama/runner/ollamarunner/runner.go:923 +0x66e fp=0xc001347ac0 sp=0xc0013477d0 pc=0x59093d58ccae
github.com/ollama/ollama/runner/ollamarunner.(*Server).completion-fm({0x59093e699c20?, 0xc0001622a0?}, 0xc001347b40?)
:1 +0x36 fp=0xc001347af0 sp=0xc001347ac0 pc=0x59093d5926d6
net/http.HandlerFunc.ServeHTTP(0xc00053aa80?, {0x59093e699c20?, 0xc0001622a0?}, 0xc001347b60?)
net/http/server.go:2294 +0x29 fp=0xc001347b18 sp=0xc001347af0 pc=0x59093d325fc9
net/http.(*ServeMux).ServeHTTP(0x59093cfcaa85?, {0x59093e699c20, 0xc0001622a0}, 0xc0004963c0)
net/http/server.go:2822 +0x1c4 fp=0xc001347b68 sp=0xc001347b18 pc=0x59093d327ec4
net/http.serverHandler.ServeHTTP({0x59093e696110?}, {0x59093e699c20?, 0xc0001622a0?}, 0x1?)
net/http/server.go:3301 +0x8e fp=0xc001347b98 sp=0xc001347b68 pc=0x59093d34594e
net/http.(*conn).serve(0xc0000e9dd0, {0x59093e69c068, 0xc000218d20})
net/http/server.go:2102 +0x625 fp=0xc001347fb8 sp=0xc001347b98 pc=0x59093d3244c5
net/http.(*Server).Serve.gowrap3()
net/http/server.go:3454 +0x28 fp=0xc001347fe0 sp=0xc001347fb8 pc=0x59093d329d88
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc001347fe8 sp=0xc001347fe0 pc=0x59093d02dbc1
created by net/http.(*Server).Serve in goroutine 1
net/http/server.go:3454 +0x485
goroutine 9295 gp=0xc0014441c0 m=nil [sync.Mutex.Lock, 34 minutes]:
runtime.gopark(0x59093efc0f40?, 0xc000e8a0c0?, 0x80?, 0x2a?, 0x59093d023839?)
runtime/proc.go:435 +0xce fp=0xc00007ba88 sp=0xc00007ba68 pc=0x59093d025d2e
runtime.goparkunlock(...)
runtime/proc.go:441
runtime.semacquire1(0xc0002231dc, 0x0, 0x3, 0x2, 0x15)
runtime/sema.go:188 +0x229 fp=0xc00007baf0 sp=0xc00007ba88 pc=0x59093d005ca9
internal/sync.runtime_SemacquireMutex(0x59093d41c4d4?, 0x68?, 0xc000e8a0c0?)
runtime/sema.go:95 +0x25 fp=0xc00007bb28 sp=0xc00007baf0 pc=0x59093d027545
internal/sync.(*Mutex).lockSlow(0xc0002231d8)
internal/sync/mutex.go:149 +0x15d fp=0xc00007bb78 sp=0xc00007bb28 pc=0x59093d03769d
internal/sync.(*Mutex).Lock(...)
internal/sync/mutex.go:70
sync.(*Mutex).Lock(...)
sync/mutex.go:46
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc0002230e0, {0x110b, {0x59093e6a7670, 0xc002ffa080}, {0x59093e6b26a0, 0xc00125b410}, {0xc000232008, 0x3fc, 0x3ff}, {{0x59093e6b26a0, ...}, ...}, ...})
github.com/ollama/ollama/runner/ollamarunner/runner.go:735 +0x972 fp=0xc00007bef0 sp=0xc00007bb78 pc=0x59093d58b292
github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1()
github.com/ollama/ollama/runner/ollamarunner/runner.go:458 +0x58 fp=0xc00007bfe0 sp=0xc00007bef0 pc=0x59093d588d98
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc00007bfe8 sp=0xc00007bfe0 pc=0x59093d02dbc1
created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 10
github.com/ollama/ollama/runner/ollamarunner/runner.go:458 +0x2cd
rax 0xca
rbx 0x0
rcx 0x59093d02f9c3
rdx 0x0
rdi 0x59093efc1080
rsi 0x80
rbp 0x7ffeef8bc1e8
rsp 0x7ffeef8bc1a0
r8 0x0
r9 0x0
r10 0x0
r11 0x286
r12 0x7ffeef8bc220
r13 0x7e3970219501
r14 0x59093efbf180
r15 0x1
rip 0x59093d02f9c1
rflags 0x286
cs 0x33
fs 0x0
gs 0x0

<!-- gh-comment-id:3764079657 --> @fmu83 commented on GitHub (Jan 17, 2026): Same issue here. gpt-oss:20b-ctx16384 CPU: Intel N100 GPU: Intel Arc Pro B50, 16 GB VRAM OLLAMA version is 0.14.3-rc1 Vulkan Instance Version: 1.4.304 Mesa 25.3.3 firmware-intel-graphics 20251021-1~bpo13+1 Ollama Model hungs every few hours after a large request Jan 17 11:35:34 intel-ai ollama[610]: time=2026-01-17T11:35:34.179Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=16384 prompt=28147 keep=4 new=16384 Ollama in general is responsive. I'm able to make a API call to list the models. But the model itself seems to be crashed. It is unresponsive if I try to do a "ollama run" and spinnes endlessly. Stacktrace after "kill -QUIT OLLAMAPID: SIGQUIT: quit PC=0x59093d02f9c1 m=0 sigcode=0 goroutine 0 gp=0x59093efbf180 m=0 mp=0x59093efc0f40 [idle]: runtime.futex(0x59093efc1080, 0x80, 0x0, 0x0, 0x0, 0x0) runtime/sys_linux_amd64.s:557 +0x21 fp=0x7ffeef8bc1a8 sp=0x7ffeef8bc1a0 pc=0x59093d02f9c1 runtime.futexsleep(0x7ffeef8bc220?, 0x3cfc8611?, 0x59093d02f5ad?) runtime/os_linux.go:75 +0x30 fp=0x7ffeef8bc1f8 sp=0x7ffeef8bc1a8 pc=0x59093cfebd70 runtime.notesleep(0x59093efc1080) runtime/lock_futex.go:47 +0x87 fp=0x7ffeef8bc230 sp=0x7ffeef8bc1f8 pc=0x59093cfc7d27 runtime.mPark(...) runtime/proc.go:1887 runtime.stopm() runtime/proc.go:2907 +0x8c fp=0x7ffeef8bc260 sp=0x7ffeef8bc230 pc=0x59093cff75cc runtime.findRunnable() runtime/proc.go:3644 +0xd9c fp=0x7ffeef8bc3d8 sp=0x7ffeef8bc260 pc=0x59093cff909c runtime.schedule() runtime/proc.go:4017 +0xb1 fp=0x7ffeef8bc410 sp=0x7ffeef8bc3d8 pc=0x59093cffa191 runtime.park_m(0xc000003340) runtime/proc.go:4141 +0x285 fp=0x7ffeef8bc470 sp=0x7ffeef8bc410 pc=0x59093cffa645 runtime.mcall() runtime/asm_amd64.s:459 +0x50 fp=0x7ffeef8bc488 sp=0x7ffeef8bc470 pc=0x59093d02bb70 goroutine 1 gp=0xc000002380 m=nil [IO wait, 28 minutes]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00134b790 sp=0xc00134b770 pc=0x59093d025d2e runtime.netpollblock(0xc00134b7e0?, 0x3cfbf466?, 0x9?) runtime/netpoll.go:575 +0xf7 fp=0xc00134b7c8 sp=0xc00134b790 pc=0x59093cfeb057 internal/poll.runtime_pollWait(0x7e3973ec6eb0, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc00134b7e8 sp=0xc00134b7c8 pc=0x59093d024f45 internal/poll.(*pollDesc).wait(0xc00011f700?, 0x900fc965e?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00134b810 sp=0xc00134b7e8 pc=0x59093d0ad0c7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc00011f700) internal/poll/fd_unix.go:620 +0x295 fp=0xc00134b8b8 sp=0xc00134b810 pc=0x59093d0b2495 net.(*netFD).accept(0xc00011f700) net/fd_unix.go:172 +0x29 fp=0xc00134b970 sp=0xc00134b8b8 pc=0x59093d125549 net.(*TCPListener).accept(0xc000525ec0) net/tcpsock_posix.go:159 +0x1b fp=0xc00134b9c0 sp=0xc00134b970 pc=0x59093d13b45b net.(*TCPListener).Accept(0xc000525ec0) net/tcpsock.go:380 +0x30 fp=0xc00134b9f0 sp=0xc00134b9c0 pc=0x59093d13a310 net/http.(*onceCloseListener).Accept(0xc0000e9dd0?) <autogenerated>:1 +0x24 fp=0xc00134ba08 sp=0xc00134b9f0 pc=0x59093d3520c4 net/http.(*Server).Serve(0xc0001ef500, {0x59093e699a40, 0xc000525ec0}) net/http/server.go:3424 +0x30c fp=0xc00134bb38 sp=0xc00134ba08 pc=0x59093d32998c github.com/ollama/ollama/runner/ollamarunner.Execute({0xc0000340a0, 0x4, 0x4}) github.com/ollama/ollama/runner/ollamarunner/runner.go:1441 +0x94e fp=0xc00134bd08 sp=0xc00134bb38 pc=0x59093d591f6e github.com/ollama/ollama/runner.Execute({0xc000034080?, 0x0?, 0x0?}) github.com/ollama/ollama/runner/runner.go:28 +0x125 fp=0xc00134bd30 sp=0xc00134bd08 pc=0x59093d5bdba5 github.com/ollama/ollama/cmd.NewCLI.func3(0xc0001ef300?, {0x59093e1350e6?, 0x4?, 0x59093e1350ea?}) github.com/ollama/ollama/cmd/cmd.go:1961 +0x45 fp=0xc00134bd58 sp=0xc00134bd30 pc=0x59093dd81125 github.com/spf13/cobra.(*Command).execute(0xc000149808, {0xc000527bd0, 0x5, 0x5}) github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc00134be78 sp=0xc00134bd58 pc=0x59093d19f4bc github.com/spf13/cobra.(*Command).ExecuteC(0xc00052a908) github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc00134bf30 sp=0xc00134be78 pc=0x59093d19fd05 github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra@v1.7.0/command.go:992 github.com/spf13/cobra.(*Command).ExecuteContext(...) github.com/spf13/cobra@v1.7.0/command.go:985 main.main() github.com/ollama/ollama/main.go:12 +0x4d fp=0xc00134bf50 sp=0xc00134bf30 pc=0x59093dd81c0d runtime.main() runtime/proc.go:283 +0x29d fp=0xc00134bfe0 sp=0xc00134bf50 pc=0x59093cff26dd runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00134bfe8 sp=0xc00134bfe0 pc=0x59093d02dbc1 goroutine 2 gp=0xc000002e00 m=nil [force gc (idle), 2 minutes]: runtime.gopark(0x1bcd18a6059f?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000064fa8 sp=0xc000064f88 pc=0x59093d025d2e runtime.goparkunlock(...) runtime/proc.go:441 runtime.forcegchelper() runtime/proc.go:348 +0xb8 fp=0xc000064fe0 sp=0xc000064fa8 pc=0x59093cff2a18 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000064fe8 sp=0xc000064fe0 pc=0x59093d02dbc1 created by runtime.init.7 in goroutine 1 runtime/proc.go:336 +0x1a goroutine 3 gp=0xc000003340 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000065780 sp=0xc000065760 pc=0x59093d025d2e runtime.goparkunlock(...) runtime/proc.go:441 runtime.bgsweep(0xc00007e000) runtime/mgcsweep.go:316 +0xdf fp=0xc0000657c8 sp=0xc000065780 pc=0x59093cfdd1bf runtime.gcenable.gowrap1() runtime/mgc.go:204 +0x25 fp=0xc0000657e0 sp=0xc0000657c8 pc=0x59093cfd15a5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000657e8 sp=0xc0000657e0 pc=0x59093d02dbc1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0x66 goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait, 2 minutes]: runtime.gopark(0x16208a7?, 0x152d9d3?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000065f78 sp=0xc000065f58 pc=0x59093d025d2e runtime.goparkunlock(...) runtime/proc.go:441 runtime.(*scavengerState).park(0x59093efbe120) runtime/mgcscavenge.go:425 +0x49 fp=0xc000065fa8 sp=0xc000065f78 pc=0x59093cfdac09 runtime.bgscavenge(0xc00007e000) runtime/mgcscavenge.go:658 +0x59 fp=0xc000065fc8 sp=0xc000065fa8 pc=0x59093cfdb199 runtime.gcenable.gowrap2() runtime/mgc.go:205 +0x25 fp=0xc000065fe0 sp=0xc000065fc8 pc=0x59093cfd1545 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000065fe8 sp=0xc000065fe0 pc=0x59093d02dbc1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:205 +0xa5 goroutine 5 gp=0xc000003dc0 m=nil [finalizer wait, 131 minutes]: runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000064688?) runtime/proc.go:435 +0xce fp=0xc000064630 sp=0xc000064610 pc=0x59093d025d2e runtime.runfinq() runtime/mfinal.go:196 +0x107 fp=0xc0000647e0 sp=0xc000064630 pc=0x59093cfd0567 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000647e8 sp=0xc0000647e0 pc=0x59093d02dbc1 created by runtime.createfing in goroutine 1 runtime/mfinal.go:166 +0x3d goroutine 6 gp=0xc0001cc8c0 m=nil [chan receive, 2 minutes]: runtime.gopark(0xc00021fae0?, 0xc0005083f0?, 0x60?, 0x67?, 0x59093d10c188?) runtime/proc.go:435 +0xce fp=0xc000066718 sp=0xc0000666f8 pc=0x59093d025d2e runtime.chanrecv(0xc00009c310, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc000066790 sp=0xc000066718 pc=0x59093cfc2045 runtime.chanrecv1(0x0?, 0x0?) runtime/chan.go:506 +0x12 fp=0xc0000667b8 sp=0xc000066790 pc=0x59093cfc1bd2 runtime.unique_runtime_registerUniqueMapCleanup.func2(...) runtime/mgc.go:1796 runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() runtime/mgc.go:1799 +0x2f fp=0xc0000667e0 sp=0xc0000667b8 pc=0x59093cfd474f runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000667e8 sp=0xc0000667e0 pc=0x59093d02dbc1 created by unique.runtime_registerUniqueMapCleanup in goroutine 1 runtime/mgc.go:1794 +0x85 goroutine 7 gp=0xc0001cd180 m=nil [GC worker (idle)]: runtime.gopark(0x1bcd1a505d78?, 0x1?, 0x67?, 0x85?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000066f38 sp=0xc000066f18 pc=0x59093d025d2e runtime.gcBgMarkWorker(0xc00009d730) runtime/mgc.go:1423 +0xe9 fp=0xc000066fc8 sp=0xc000066f38 pc=0x59093cfd3a69 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc000066fe0 sp=0xc000066fc8 pc=0x59093cfd3945 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000066fe8 sp=0xc000066fe0 pc=0x59093d02dbc1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 8 gp=0xc0001cd340 m=nil [GC worker (idle), 66 minutes]: runtime.gopark(0x1832ed8786eb?, 0x3?, 0xa4?, 0x5c?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000067738 sp=0xc000067718 pc=0x59093d025d2e runtime.gcBgMarkWorker(0xc00009d730) runtime/mgc.go:1423 +0xe9 fp=0xc0000677c8 sp=0xc000067738 pc=0x59093cfd3a69 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0000677e0 sp=0xc0000677c8 pc=0x59093cfd3945 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000677e8 sp=0xc0000677e0 pc=0x59093d02dbc1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 9 gp=0xc0001cd500 m=nil [GC worker (idle)]: runtime.gopark(0x59093f08da60?, 0x1?, 0x91?, 0xfe?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000067f38 sp=0xc000067f18 pc=0x59093d025d2e runtime.gcBgMarkWorker(0xc00009d730) runtime/mgc.go:1423 +0xe9 fp=0xc000067fc8 sp=0xc000067f38 pc=0x59093cfd3a69 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc000067fe0 sp=0xc000067fc8 pc=0x59093cfd3945 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000067fe8 sp=0xc000067fe0 pc=0x59093d02dbc1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 18 gp=0xc000102380 m=nil [GC worker (idle), 37 minutes]: runtime.gopark(0x19c7179906cb?, 0x3?, 0x1b?, 0xfa?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000060738 sp=0xc000060718 pc=0x59093d025d2e runtime.gcBgMarkWorker(0xc00009d730) runtime/mgc.go:1423 +0xe9 fp=0xc0000607c8 sp=0xc000060738 pc=0x59093cfd3a69 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0000607e0 sp=0xc0000607c8 pc=0x59093cfd3945 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000607e8 sp=0xc0000607e0 pc=0x59093d02dbc1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 10 gp=0xc000540700 m=10 mp=0xc00009f808 [syscall, 37 minutes]: runtime.cgocall(0x59093ddeffe5, 0xc007033318) runtime/cgocall.go:167 +0x4b fp=0xc0070332f0 sp=0xc0070332b8 pc=0x59093d0228ab github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(0x7e395c0017b0, 0x7e35a433fd50) _cgo_gotypes.go:977 +0x4a fp=0xc007033318 sp=0xc0070332f0 pc=0x59093d4a390a github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify.func2(...) github.com/ollama/ollama/ml/backend/ggml/ggml.go:825 github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify(0xc0002d0180, 0x0?, {0x0, 0x0, 0xc007033518?}) github.com/ollama/ollama/ml/backend/ggml/ggml.go:825 +0x1b2 fp=0xc0070333f0 sp=0xc007033318 pc=0x59093d4b12d2 github.com/ollama/ollama/ml/backend/ggml.(*Context).Compute(0xc0002d0180?, {0x0?, 0xc0002d0180?, 0x59093e6b26a0?}) github.com/ollama/ollama/ml/backend/ggml/ggml.go:811 +0x25 fp=0xc007033428 sp=0xc0070333f0 pc=0x59093d4b10e5 github.com/ollama/ollama/kvcache.(*Causal).shift(0xc0001ef600, 0x0, 0x4, 0xffffe002) github.com/ollama/ollama/kvcache/causal.go:608 +0x250 fp=0xc007033588 sp=0xc007033428 pc=0x59093d49f030 github.com/ollama/ollama/kvcache.(*Causal).Remove(0xc0001ef600, 0x0, 0x4, 0x2002) github.com/ollama/ollama/kvcache/causal.go:659 +0x285 fp=0xc007033620 sp=0xc007033588 pc=0x59093d49f6c5 github.com/ollama/ollama/kvcache.(*WrapperCache).Remove(0xc000114890?, 0x0, 0x4, 0x2002) github.com/ollama/ollama/kvcache/wrapper.go:103 +0x5e fp=0xc007033658 sp=0xc007033620 pc=0x59093d4a0b3e github.com/ollama/ollama/runner/ollamarunner.(*InputCache).ShiftCacheSlot(0xc00302c880, 0xc00053a600, 0x4) github.com/ollama/ollama/runner/ollamarunner/cache.go:290 +0x34c fp=0xc0070337f0 sp=0xc007033658 pc=0x59093d5864ec github.com/ollama/ollama/runner/ollamarunner.(*Server).forwardBatch(_, {0x110b, {0x59093e6a7670, 0xc002ffa080}, {0x59093e6b26a0, 0xc00125b410}, {0xc000232008, 0x3fc, 0x3ff}, {{0x59093e6b26a0, ...}, ...}, ...}) github.com/ollama/ollama/runner/ollamarunner/runner.go:565 +0xec5 fp=0xc007033b58 sp=0xc0070337f0 pc=0x59093d589c85 github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc0002230e0, {0x59093e69c0a0, 0xc000527c70}) github.com/ollama/ollama/runner/ollamarunner/runner.go:452 +0x18c fp=0xc007033fb8 sp=0xc007033b58 pc=0x59093d588b6c github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap1() github.com/ollama/ollama/runner/ollamarunner/runner.go:1418 +0x28 fp=0xc007033fe0 sp=0xc007033fb8 pc=0x59093d5921e8 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc007033fe8 sp=0xc007033fe0 pc=0x59093d02dbc1 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/ollamarunner/runner.go:1418 +0x4c9 goroutine 9306 gp=0xc000808c40 m=nil [sync.Mutex.Lock, 28 minutes]: runtime.gopark(0x0?, 0xc001347710?, 0xfe?, 0x25?, 0xc00009c5b0?) runtime/proc.go:435 +0xce fp=0xc0013476e0 sp=0xc0013476c0 pc=0x59093d025d2e runtime.goparkunlock(...) runtime/proc.go:441 runtime.semacquire1(0xc0002231dc, 0x0, 0x3, 0x2, 0x15) runtime/sema.go:188 +0x229 fp=0xc001347748 sp=0xc0013476e0 pc=0x59093d005ca9 internal/sync.runtime_SemacquireMutex(0xc0013477c0?, 0x9f?, 0x59093e526e00?) runtime/sema.go:95 +0x25 fp=0xc001347780 sp=0xc001347748 pc=0x59093d027545 internal/sync.(*Mutex).lockSlow(0xc0002231d8) internal/sync/mutex.go:149 +0x15d fp=0xc0013477d0 sp=0xc001347780 pc=0x59093d03769d internal/sync.(*Mutex).Lock(...) internal/sync/mutex.go:70 sync.(*Mutex).Lock(...) sync/mutex.go:46 github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(0xc0002230e0, {0x59093e699c20, 0xc0001622a0}, 0xc0004963c0) github.com/ollama/ollama/runner/ollamarunner/runner.go:923 +0x66e fp=0xc001347ac0 sp=0xc0013477d0 pc=0x59093d58ccae github.com/ollama/ollama/runner/ollamarunner.(*Server).completion-fm({0x59093e699c20?, 0xc0001622a0?}, 0xc001347b40?) <autogenerated>:1 +0x36 fp=0xc001347af0 sp=0xc001347ac0 pc=0x59093d5926d6 net/http.HandlerFunc.ServeHTTP(0xc00053aa80?, {0x59093e699c20?, 0xc0001622a0?}, 0xc001347b60?) net/http/server.go:2294 +0x29 fp=0xc001347b18 sp=0xc001347af0 pc=0x59093d325fc9 net/http.(*ServeMux).ServeHTTP(0x59093cfcaa85?, {0x59093e699c20, 0xc0001622a0}, 0xc0004963c0) net/http/server.go:2822 +0x1c4 fp=0xc001347b68 sp=0xc001347b18 pc=0x59093d327ec4 net/http.serverHandler.ServeHTTP({0x59093e696110?}, {0x59093e699c20?, 0xc0001622a0?}, 0x1?) net/http/server.go:3301 +0x8e fp=0xc001347b98 sp=0xc001347b68 pc=0x59093d34594e net/http.(*conn).serve(0xc0000e9dd0, {0x59093e69c068, 0xc000218d20}) net/http/server.go:2102 +0x625 fp=0xc001347fb8 sp=0xc001347b98 pc=0x59093d3244c5 net/http.(*Server).Serve.gowrap3() net/http/server.go:3454 +0x28 fp=0xc001347fe0 sp=0xc001347fb8 pc=0x59093d329d88 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc001347fe8 sp=0xc001347fe0 pc=0x59093d02dbc1 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3454 +0x485 goroutine 9295 gp=0xc0014441c0 m=nil [sync.Mutex.Lock, 34 minutes]: runtime.gopark(0x59093efc0f40?, 0xc000e8a0c0?, 0x80?, 0x2a?, 0x59093d023839?) runtime/proc.go:435 +0xce fp=0xc00007ba88 sp=0xc00007ba68 pc=0x59093d025d2e runtime.goparkunlock(...) runtime/proc.go:441 runtime.semacquire1(0xc0002231dc, 0x0, 0x3, 0x2, 0x15) runtime/sema.go:188 +0x229 fp=0xc00007baf0 sp=0xc00007ba88 pc=0x59093d005ca9 internal/sync.runtime_SemacquireMutex(0x59093d41c4d4?, 0x68?, 0xc000e8a0c0?) runtime/sema.go:95 +0x25 fp=0xc00007bb28 sp=0xc00007baf0 pc=0x59093d027545 internal/sync.(*Mutex).lockSlow(0xc0002231d8) internal/sync/mutex.go:149 +0x15d fp=0xc00007bb78 sp=0xc00007bb28 pc=0x59093d03769d internal/sync.(*Mutex).Lock(...) internal/sync/mutex.go:70 sync.(*Mutex).Lock(...) sync/mutex.go:46 github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc0002230e0, {0x110b, {0x59093e6a7670, 0xc002ffa080}, {0x59093e6b26a0, 0xc00125b410}, {0xc000232008, 0x3fc, 0x3ff}, {{0x59093e6b26a0, ...}, ...}, ...}) github.com/ollama/ollama/runner/ollamarunner/runner.go:735 +0x972 fp=0xc00007bef0 sp=0xc00007bb78 pc=0x59093d58b292 github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1() github.com/ollama/ollama/runner/ollamarunner/runner.go:458 +0x58 fp=0xc00007bfe0 sp=0xc00007bef0 pc=0x59093d588d98 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00007bfe8 sp=0xc00007bfe0 pc=0x59093d02dbc1 created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 10 github.com/ollama/ollama/runner/ollamarunner/runner.go:458 +0x2cd rax 0xca rbx 0x0 rcx 0x59093d02f9c3 rdx 0x0 rdi 0x59093efc1080 rsi 0x80 rbp 0x7ffeef8bc1e8 rsp 0x7ffeef8bc1a0 r8 0x0 r9 0x0 r10 0x0 r11 0x286 r12 0x7ffeef8bc220 r13 0x7e3970219501 r14 0x59093efbf180 r15 0x1 rip 0x59093d02f9c1 rflags 0x286 cs 0x33 fs 0x0 gs 0x0
Author
Owner

@fmu83 commented on GitHub (Jan 17, 2026):

I tried it with smaller CTX (4096) as well with same result. Every few hours the model hang and maxes out one CPU.

<!-- gh-comment-id:3764082163 --> @fmu83 commented on GitHub (Jan 17, 2026): I tried it with smaller CTX (4096) as well with same result. Every few hours the model hang and maxes out one CPU.
Author
Owner

@arlaneenalra commented on GitHub (Jan 17, 2026):

Note: From what I'm seeing you'd have to push the context larger not smaller .. :( or shrink the prompt ...

So far, updating to Ubuntu 25.10 has helped with overall stability, but I'm still seeing it drop into the 1 core at 100% state. From what I've been able to tell, any time server side truncation happens, it seems to fail.

I discovered this kind of by accident because I had miss configured context size on a model with 40k context trained. I had expected the server to allow the extended context but instead it clamped to 40k and triggered this chain of logs:

Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.402Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:40960 KvCacheType: NumThreads:16 GPULayers:6>
Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=ggml.go:482 msg="offloading 64 repeating layers to GPU"
Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=ggml.go:494 msg="offloaded 65/65 layers to GPU"
Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="18.4 GiB"
Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=device.go:245 msg="model weights" device=CPU size="417.3 MiB"
Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="10.0 GiB"
Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=device.go:262 msg="compute graph" device=Vulkan0 size="276.0 MiB"
Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.0 MiB"
Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=device.go:272 msg="total memory" size="29.1 GiB"
Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=sched.go:526 msg="loaded runners" count=1
Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=server.go:1347 msg="waiting for llama runner to start responding"
Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=server.go:1381 msg="waiting for server to become available" status="llm server loading model"
Jan 16 23:36:14 framework ollama[456332]: time=2026-01-16T23:36:14.423Z level=INFO source=server.go:1385 msg="llama runner started in 8.78 seconds"
Jan 16 23:36:14 framework ollama[456332]: time=2026-01-16T23:36:14.477Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=40960 prompt=41013 keep=4 new=40960
Jan 17 02:12:25 framework systemd[1]: Stopping ollama.service - Ollama Service...

The restart at the end is me manually restarting the Ollama Service.

I've been slowly upgrading as new versions have released and while this issue has changes slightly in character, it seems to keep happening anytime something in the server decides to truncate the context window. The only way I've been able to avoid it is picking models that have a large enough context window and making sure my num_ctx settings are large enough to avoid server side truncation.

I kind of wish there was a way to just set a global context of 128k/256k and have that active no matter the trained context of the underlying model, but that would really only be a stop gap to work around the problem not a solution to what's actually going on.

What I'm seeing now could be something else since I'm not seeing the crash after updating the host os...

<!-- gh-comment-id:3764224906 --> @arlaneenalra commented on GitHub (Jan 17, 2026): Note: From what I'm seeing you'd have to push the context larger not smaller .. :( or shrink the prompt ... So far, updating to Ubuntu 25.10 has helped with overall stability, but I'm still seeing it drop into the 1 core at 100% state. From what I've been able to tell, any time server side truncation happens, it seems to fail. I discovered this kind of by accident because I had miss configured context size on a model with 40k context trained. I had expected the server to allow the extended context but instead it clamped to 40k and triggered this chain of logs: ``` Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.402Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:40960 KvCacheType: NumThreads:16 GPULayers:6> Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=ggml.go:482 msg="offloading 64 repeating layers to GPU" Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU" Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=ggml.go:494 msg="offloaded 65/65 layers to GPU" Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="18.4 GiB" Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=device.go:245 msg="model weights" device=CPU size="417.3 MiB" Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="10.0 GiB" Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=device.go:262 msg="compute graph" device=Vulkan0 size="276.0 MiB" Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.0 MiB" Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=device.go:272 msg="total memory" size="29.1 GiB" Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=sched.go:526 msg="loaded runners" count=1 Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=server.go:1347 msg="waiting for llama runner to start responding" Jan 16 23:36:07 framework ollama[456332]: time=2026-01-16T23:36:07.403Z level=INFO source=server.go:1381 msg="waiting for server to become available" status="llm server loading model" Jan 16 23:36:14 framework ollama[456332]: time=2026-01-16T23:36:14.423Z level=INFO source=server.go:1385 msg="llama runner started in 8.78 seconds" Jan 16 23:36:14 framework ollama[456332]: time=2026-01-16T23:36:14.477Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=40960 prompt=41013 keep=4 new=40960 Jan 17 02:12:25 framework systemd[1]: Stopping ollama.service - Ollama Service... ``` The restart at the end is me manually restarting the Ollama Service. I've been slowly upgrading as new versions have released and while this issue has changes slightly in character, it seems to keep happening anytime something in the server decides to truncate the context window. The only way I've been able to avoid it is picking models that have a large enough context window and making sure my num_ctx settings are large enough to avoid server side truncation. I kind of wish there was a way to just set a global context of 128k/256k and have that active no matter the trained context of the underlying model, but that would really only be a stop gap to work around the problem not a solution to what's actually going on. What I'm seeing now could be something else since I'm not seeing the crash after updating the host os...
Author
Owner

@fmu83 commented on GitHub (Jan 18, 2026):

Thanks, for you update. I updated all for me possible parts (Intel GPU Firmware, Vulkan, Mesa) to the last available version for Debian/ ubuntu (version numbers see my post above). For updating the kernel I'm limited, because ollama runs inside a Container in Proxmox, so I have to use the Proxmox kernel (Linux 6.17.4-2-pve).

The hang event happens after a few minutes after log message of truncation is visible. I build a watchtdog which restarts ollama after it becomes unresponsive.

Meanwhile I did a analysis on possible root causes:

When using Vulkan backend on an Intel GPU, the runner becomes permanently stuck after a request that triggers context truncation / KV cache shifting. The Ollama server remains responsive for lightweight endpoints (/api/tags, /api/ps), but all inference requests (ollama run, /api/generate, /v1/chat/completions) hang indefinitely until the service is restarted.

Environment

• Kernel: Linux 6.17.4-2-pve
• GPU: Intel B50 Pro (16 GB VRAM) (Vulkan)
• Ollama: ollama ps shows model loaded on 100% GPU
• Model: gpt-oss:20b-ctx16384 (context 16384)

Reproduction

  1. Run model on Vulkan backend (Intel GPU).
  2. Send a very long prompt/history that exceeds context size, e.g. context limit 16384.
  3. Observe Ollama log warning about truncation:
  4. truncating input prompt limit=16384 prompt=27835 keep=4 new=16384
  5. After this, inference endpoints hang:
    o ollama run gpt-oss:20b-ctx16384 → no output / never returns
    o /api/generate or /v1/chat/completions → request hangs / times out
  6. Control endpoints still work:
    o /api/tags, /api/ps return immediately

Expected

• Either the request completes, or it fails cleanly with an error (timeout / “context too large” / etc.).
• Subsequent inference requests should still work (or the runner should restart automatically).

Actual

• Runner gets stuck forever. Only restarting Ollama fixes it.
• CPU shows one core pegged at 100% (busy loop / stuck state), while the model remains listed as loaded on GPU.

Stack trace (SIGQUIT)

Key parts: one goroutine stuck for a long time inside a cgo call to ggml Vulkan compute, while other goroutines wait on a mutex in completion/computeBatch.

goroutine 10 ... [syscall, 37 minutes]:
runtime.cgocall(...)
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(...)
github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify(...)
github.com/ollama/ollama/kvcache.(*Causal).shift(...)
github.com/ollama/ollama/kvcache.(*Causal).Remove(...)
github.com/ollama/ollama/runner/ollamarunner.(*InputCache).ShiftCacheSlot(...)
github.com/ollama/ollama/runner/ollamarunner.(*Server).forwardBatch(...)
github.com/ollama/ollama/runner/ollamarunner.(*Server).run(...)

goroutine ... [sync.Mutex.Lock, ...]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(...)
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(...)
(Full trace attached above.)

Hypothesis / Root cause

• The trigger appears to be context truncation / KV cache shifting (ShiftCacheSlot → Causal.shift/Remove).
• During this path, the Vulkan backend enters ggml_backend_sched_graph_compute_async(...) and never returns (likely GPU/driver/backend hang).
• Because the runner holds or requires shared locks, other requests block on mutexes, effectively hanging inference globally.

Proposed fixes / improvements

1) Add a runner-side watchdog/timeout for GPU compute
If a single compute call (or forward batch) exceeds a configured deadline:
• abort the request and return an error
• reset the backend/runner state (or terminate and restart the runner process)
Even if Vulkan/driver hangs, Ollama should recover automatically instead of staying permanently wedged.
2) Improve failure handling around async compute
• Ensure return codes/errors from ggml_backend_sched_graph_compute_async (and related functions) are always checked and propagated.
• If Vulkan returns device lost / error, force a backend reset.
3) Reduce global lock contention so one stuck compute does not block all requests
• Move compute to a worker/queue model and avoid holding global mutexes across long-running operations.
• Make completion/computeBatch resilient to a stuck compute path (e.g., request-scoped cancellation, lock-free state transitions).
4) Mitigation in the meantime (client-side)
• Avoid triggering KV-cache shift by keeping request history below num_ctx (limit chat history / summarise history / chunking).

Notes

• After the hang, /api/tags and /api/ps remain responsive, but any inference hangs indefinitely.
• Restarting ollama restores functionality.

<!-- gh-comment-id:3765082163 --> @fmu83 commented on GitHub (Jan 18, 2026): Thanks, for you update. I updated all for me possible parts (Intel GPU Firmware, Vulkan, Mesa) to the last available version for Debian/ ubuntu (version numbers see my post above). For updating the kernel I'm limited, because ollama runs inside a Container in Proxmox, so I have to use the Proxmox kernel (Linux 6.17.4-2-pve). The hang event happens after a few minutes after log message of truncation is visible. I build a watchtdog which restarts ollama after it becomes unresponsive. Meanwhile I did a analysis on possible root causes: When using Vulkan backend on an Intel GPU, the runner becomes permanently stuck after a request that triggers context truncation / KV cache shifting. The Ollama server remains responsive for lightweight endpoints (/api/tags, /api/ps), but all inference requests (ollama run, /api/generate, /v1/chat/completions) hang indefinitely until the service is restarted. ### Environment • Kernel: Linux 6.17.4-2-pve • GPU: Intel B50 Pro (16 GB VRAM) (Vulkan) • Ollama: ollama ps shows model loaded on 100% GPU • Model: gpt-oss:20b-ctx16384 (context 16384) ### Reproduction 1. Run model on Vulkan backend (Intel GPU). 2. Send a very long prompt/history that exceeds context size, e.g. context limit 16384. 3. Observe Ollama log warning about truncation: 4. truncating input prompt limit=16384 prompt=27835 keep=4 new=16384 5. After this, inference endpoints hang: o ollama run gpt-oss:20b-ctx16384 → no output / never returns o /api/generate or /v1/chat/completions → request hangs / times out 6. Control endpoints still work: o /api/tags, /api/ps return immediately ### Expected • Either the request completes, or it fails cleanly with an error (timeout / “context too large” / etc.). • Subsequent inference requests should still work (or the runner should restart automatically). ### Actual • Runner gets stuck forever. Only restarting Ollama fixes it. • CPU shows one core pegged at 100% (busy loop / stuck state), while the model remains listed as loaded on GPU. ### Stack trace (SIGQUIT) Key parts: one goroutine stuck for a long time inside a cgo call to ggml Vulkan compute, while other goroutines wait on a mutex in completion/computeBatch. goroutine 10 ... [syscall, 37 minutes]: runtime.cgocall(...) github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(...) github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify(...) github.com/ollama/ollama/kvcache.(*Causal).shift(...) github.com/ollama/ollama/kvcache.(*Causal).Remove(...) github.com/ollama/ollama/runner/ollamarunner.(*InputCache).ShiftCacheSlot(...) github.com/ollama/ollama/runner/ollamarunner.(*Server).forwardBatch(...) github.com/ollama/ollama/runner/ollamarunner.(*Server).run(...) goroutine ... [sync.Mutex.Lock, ...]: github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(...) github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(...) (Full trace attached above.) ### Hypothesis / Root cause • The trigger appears to be context truncation / KV cache shifting (ShiftCacheSlot → Causal.shift/Remove). • During this path, the Vulkan backend enters ggml_backend_sched_graph_compute_async(...) and never returns (likely GPU/driver/backend hang). • Because the runner holds or requires shared locks, other requests block on mutexes, effectively hanging inference globally. ### Proposed fixes / improvements **1) Add a runner-side watchdog/timeout for GPU compute** If a single compute call (or forward batch) exceeds a configured deadline: • abort the request and return an error • reset the backend/runner state (or terminate and restart the runner process) Even if Vulkan/driver hangs, Ollama should recover automatically instead of staying permanently wedged. **2) Improve failure handling around async compute** • Ensure return codes/errors from ggml_backend_sched_graph_compute_async (and related functions) are always checked and propagated. • If Vulkan returns device lost / error, force a backend reset. **3) Reduce global lock contention so one stuck compute does not block all requests** • Move compute to a worker/queue model and avoid holding global mutexes across long-running operations. • Make completion/computeBatch resilient to a stuck compute path (e.g., request-scoped cancellation, lock-free state transitions). **4) Mitigation in the meantime (client-side)** • Avoid triggering KV-cache shift by keeping request history below num_ctx (limit chat history / summarise history / chunking). ### Notes • After the hang, /api/tags and /api/ps remain responsive, but any inference hangs indefinitely. • Restarting ollama restores functionality.
Author
Owner

@svenstaro commented on GitHub (Feb 2, 2026):

I think the title should be amended with something like "on Vulkan" because it doesn't appear on at least ROCm for me.

<!-- gh-comment-id:3837822064 --> @svenstaro commented on GitHub (Feb 2, 2026): I think the title should be amended with something like "on Vulkan" because it doesn't appear on at least ROCm for me.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#34642