[GH-ISSUE #14094] Title: Ollama fails to use GPU for any model on Ubuntu 22.04.5 LTS despite CUDA being installed #55715

Closed
opened 2026-04-29 09:37:42 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @gymm-betrayer on GitHub (Feb 5, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14094

What is the issue?

We are running Ollama on Ubuntu 22.04.5 LTS and NVIDIA RTX 4090 GPUs (24GB VRAM). However, no models — not even small ones like qwen3:8b — are being offloaded to the GPU. All inference runs entirely on CPU.

Relevant log output

time=2026-02-05T22:08:17.055+08:00 level=INFO source=ggml.go:136 msg="" architecture=qwen3vlmoe file_type=Q4_K_M name="" description="" num_tensors=1038 num_key_values=43
time=2026-02-05T22:08:17.055+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
time=2026-02-05T22:08:17.694+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T22:08:18.625+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T22:08:18.625+08:00 level=INFO source=ggml.go:482 msg="offloading 0 repeating layers to GPU"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=ggml.go:494 msg="offloaded 0/49 layers to GPU"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="18.2 GiB"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="384.0 MiB"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="364.0 MiB"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:272 msg="total memory" size="19.0 GiB"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=sched.go:517 msg="loaded runners" count=1
time=2026-02-05T22:08:18.626+08:00 level=INFO source=server.go:1338 msg="waiting for llama runner to start responding"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model"
time=2026-02-05T22:08:19.637+08:00 level=INFO source=server.go:1376 msg="llama runner started in 2.71 seconds"

OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.1.35

Originally created by @gymm-betrayer on GitHub (Feb 5, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14094 ### What is the issue? We are running Ollama on Ubuntu 22.04.5 LTS and NVIDIA RTX 4090 GPUs (24GB VRAM). However, no models — not even small ones like qwen3:8b — are being offloaded to the GPU. All inference runs entirely on CPU. ### Relevant log output ```shell time=2026-02-05T22:08:17.055+08:00 level=INFO source=ggml.go:136 msg="" architecture=qwen3vlmoe file_type=Q4_K_M name="" description="" num_tensors=1038 num_key_values=43 time=2026-02-05T22:08:17.055+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc) time=2026-02-05T22:08:17.694+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T22:08:18.625+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T22:08:18.625+08:00 level=INFO source=ggml.go:482 msg="offloading 0 repeating layers to GPU" time=2026-02-05T22:08:18.626+08:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-02-05T22:08:18.626+08:00 level=INFO source=ggml.go:494 msg="offloaded 0/49 layers to GPU" time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="18.2 GiB" time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="384.0 MiB" time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="364.0 MiB" time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:272 msg="total memory" size="19.0 GiB" time=2026-02-05T22:08:18.626+08:00 level=INFO source=sched.go:517 msg="loaded runners" count=1 time=2026-02-05T22:08:18.626+08:00 level=INFO source=server.go:1338 msg="waiting for llama runner to start responding" time=2026-02-05T22:08:18.626+08:00 level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model" time=2026-02-05T22:08:19.637+08:00 level=INFO source=server.go:1376 msg="llama runner started in 2.71 seconds" ``` ### OS Linux ### GPU Nvidia ### CPU _No response_ ### Ollama version 0.1.35
GiteaMirror added the bugneeds more info labels 2026-04-29 09:37:42 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 5, 2026):

time=2026-02-05T22:08:17.055+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)

No accelerated backends found. Post the full log from the start.

<!-- gh-comment-id:3854102588 --> @rick-github commented on GitHub (Feb 5, 2026): ``` time=2026-02-05T22:08:17.055+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc) ``` No accelerated backends found. Post the full log from the start.
Author
Owner

@gymm-betrayer commented on GitHub (Feb 5, 2026):

time=2026-02-05T22:08:02.541+08:00 level=INFO source=routes.go:1554 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/host/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-02-05T22:08:02.587+08:00 level=INFO source=images.go:493 msg="total blobs: 6"
time=2026-02-05T22:08:02.587+08:00 level=INFO source=images.go:500 msg="total unused blobs removed: 0"
time=2026-02-05T22:08:02.587+08:00 level=INFO source=routes.go:1607 msg="Listening on 127.0.0.1:11434 (version 0.13.5)"
time=2026-02-05T22:08:02.588+08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-02-05T22:08:02.589+08:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 44581"
time=2026-02-05T22:08:02.642+08:00 level=INFO source=types.go:60 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="251.5 GiB" available="234.5 GiB"
time=2026-02-05T22:08:02.642+08:00 level=INFO source=routes.go:1648 msg="entering low vram mode" "total vram"="0 B" threshold="20.0 GiB"
time=2026-02-05T22:08:16.923+08:00 level=INFO source=server.go:245 msg="enabling flash attention"
time=2026-02-05T22:08:16.923+08:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/host/.ollama/models/blobs/sha256-b1da6f96a2e40e5db05b6066d799c69411225b336bfa20ef1b002c223ed4b190 --port 39601"
time=2026-02-05T22:08:16.924+08:00 level=INFO source=sched.go:443 msg="system memory" total="251.5 GiB" free="234.2 GiB" free_swap="0 B"
time=2026-02-05T22:08:16.924+08:00 level=INFO source=server.go:746 msg="loading model" "model layers"=49 requested=-1
time=2026-02-05T22:08:16.955+08:00 level=INFO source=runner.go:1405 msg="starting ollama engine"
time=2026-02-05T22:08:16.955+08:00 level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:39601"
time=2026-02-05T22:08:16.960+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T22:08:17.055+08:00 level=INFO source=ggml.go:136 msg="" architecture=qwen3vlmoe file_type=Q4_K_M name="" description="" num_tensors=1038 num_key_values=43
time=2026-02-05T22:08:17.055+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
time=2026-02-05T22:08:17.694+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T22:08:18.625+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T22:08:18.625+08:00 level=INFO source=ggml.go:482 msg="offloading 0 repeating layers to GPU"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=ggml.go:494 msg="offloaded 0/49 layers to GPU"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="18.2 GiB"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="384.0 MiB"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="364.0 MiB"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:272 msg="total memory" size="19.0 GiB"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=sched.go:517 msg="loaded runners" count=1
time=2026-02-05T22:08:18.626+08:00 level=INFO source=server.go:1338 msg="waiting for llama runner to start responding"
time=2026-02-05T22:08:18.626+08:00 level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model"
time=2026-02-05T22:08:19.637+08:00 level=INFO source=server.go:1376 msg="llama runner started in 2.71 seconds"
[GIN] 2026/02/05 - 22:18:16 | 500 | 10m0s | 127.0.0.1 | POST "/api/chat"
[GIN] 2026/02/05 - 22:23:28 | 500 | 5m0s | 127.0.0.1 | POST "/api/chat"
[GIN] 2026/02/05 - 23:01:01 | 404 | 812.961µs | 127.0.0.1 | POST "/api/generate"
time=2026-02-05T23:01:12.940+08:00 level=INFO source=server.go:245 msg="enabling flash attention"
time=2026-02-05T23:01:12.941+08:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/host/.ollama/models/blobs/sha256-b1da6f96a2e40e5db05b6066d799c69411225b336bfa20ef1b002c223ed4b190 --port 40739"
time=2026-02-05T23:01:12.941+08:00 level=INFO source=sched.go:443 msg="system memory" total="251.5 GiB" free="234.3 GiB" free_swap="0 B"
time=2026-02-05T23:01:12.941+08:00 level=INFO source=server.go:746 msg="loading model" "model layers"=49 requested=-1
time=2026-02-05T23:01:12.972+08:00 level=INFO source=runner.go:1405 msg="starting ollama engine"
time=2026-02-05T23:01:12.973+08:00 level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:40739"
time=2026-02-05T23:01:12.975+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T23:01:13.061+08:00 level=INFO source=ggml.go:136 msg="" architecture=qwen3vlmoe file_type=Q4_K_M name="" description="" num_tensors=1038 num_key_values=43
time=2026-02-05T23:01:13.062+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
time=2026-02-05T23:01:13.702+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T23:01:14.654+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T23:01:14.655+08:00 level=INFO source=ggml.go:482 msg="offloading 0 repeating layers to GPU"
time=2026-02-05T23:01:14.655+08:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-02-05T23:01:14.655+08:00 level=INFO source=ggml.go:494 msg="offloaded 0/49 layers to GPU"
time=2026-02-05T23:01:14.655+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="18.2 GiB"
time=2026-02-05T23:01:14.655+08:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="384.0 MiB"
time=2026-02-05T23:01:14.655+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="364.0 MiB"
time=2026-02-05T23:01:14.655+08:00 level=INFO source=device.go:272 msg="total memory" size="19.0 GiB"

<!-- gh-comment-id:3854259792 --> @gymm-betrayer commented on GitHub (Feb 5, 2026): time=2026-02-05T22:08:02.541+08:00 level=INFO source=routes.go:1554 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/host/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2026-02-05T22:08:02.587+08:00 level=INFO source=images.go:493 msg="total blobs: 6" time=2026-02-05T22:08:02.587+08:00 level=INFO source=images.go:500 msg="total unused blobs removed: 0" time=2026-02-05T22:08:02.587+08:00 level=INFO source=routes.go:1607 msg="Listening on 127.0.0.1:11434 (version 0.13.5)" time=2026-02-05T22:08:02.588+08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-02-05T22:08:02.589+08:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 44581" time=2026-02-05T22:08:02.642+08:00 level=INFO source=types.go:60 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="251.5 GiB" available="234.5 GiB" time=2026-02-05T22:08:02.642+08:00 level=INFO source=routes.go:1648 msg="entering low vram mode" "total vram"="0 B" threshold="20.0 GiB" time=2026-02-05T22:08:16.923+08:00 level=INFO source=server.go:245 msg="enabling flash attention" time=2026-02-05T22:08:16.923+08:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/host/.ollama/models/blobs/sha256-b1da6f96a2e40e5db05b6066d799c69411225b336bfa20ef1b002c223ed4b190 --port 39601" time=2026-02-05T22:08:16.924+08:00 level=INFO source=sched.go:443 msg="system memory" total="251.5 GiB" free="234.2 GiB" free_swap="0 B" time=2026-02-05T22:08:16.924+08:00 level=INFO source=server.go:746 msg="loading model" "model layers"=49 requested=-1 time=2026-02-05T22:08:16.955+08:00 level=INFO source=runner.go:1405 msg="starting ollama engine" time=2026-02-05T22:08:16.955+08:00 level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:39601" time=2026-02-05T22:08:16.960+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T22:08:17.055+08:00 level=INFO source=ggml.go:136 msg="" architecture=qwen3vlmoe file_type=Q4_K_M name="" description="" num_tensors=1038 num_key_values=43 time=2026-02-05T22:08:17.055+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc) time=2026-02-05T22:08:17.694+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T22:08:18.625+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T22:08:18.625+08:00 level=INFO source=ggml.go:482 msg="offloading 0 repeating layers to GPU" time=2026-02-05T22:08:18.626+08:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-02-05T22:08:18.626+08:00 level=INFO source=ggml.go:494 msg="offloaded 0/49 layers to GPU" time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="18.2 GiB" time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="384.0 MiB" time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="364.0 MiB" time=2026-02-05T22:08:18.626+08:00 level=INFO source=device.go:272 msg="total memory" size="19.0 GiB" time=2026-02-05T22:08:18.626+08:00 level=INFO source=sched.go:517 msg="loaded runners" count=1 time=2026-02-05T22:08:18.626+08:00 level=INFO source=server.go:1338 msg="waiting for llama runner to start responding" time=2026-02-05T22:08:18.626+08:00 level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model" time=2026-02-05T22:08:19.637+08:00 level=INFO source=server.go:1376 msg="llama runner started in 2.71 seconds" [GIN] 2026/02/05 - 22:18:16 | 500 | 10m0s | 127.0.0.1 | POST "/api/chat" [GIN] 2026/02/05 - 22:23:28 | 500 | 5m0s | 127.0.0.1 | POST "/api/chat" [GIN] 2026/02/05 - 23:01:01 | 404 | 812.961µs | 127.0.0.1 | POST "/api/generate" time=2026-02-05T23:01:12.940+08:00 level=INFO source=server.go:245 msg="enabling flash attention" time=2026-02-05T23:01:12.941+08:00 level=INFO source=server.go:429 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/host/.ollama/models/blobs/sha256-b1da6f96a2e40e5db05b6066d799c69411225b336bfa20ef1b002c223ed4b190 --port 40739" time=2026-02-05T23:01:12.941+08:00 level=INFO source=sched.go:443 msg="system memory" total="251.5 GiB" free="234.3 GiB" free_swap="0 B" time=2026-02-05T23:01:12.941+08:00 level=INFO source=server.go:746 msg="loading model" "model layers"=49 requested=-1 time=2026-02-05T23:01:12.972+08:00 level=INFO source=runner.go:1405 msg="starting ollama engine" time=2026-02-05T23:01:12.973+08:00 level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:40739" time=2026-02-05T23:01:12.975+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T23:01:13.061+08:00 level=INFO source=ggml.go:136 msg="" architecture=qwen3vlmoe file_type=Q4_K_M name="" description="" num_tensors=1038 num_key_values=43 time=2026-02-05T23:01:13.062+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc) time=2026-02-05T23:01:13.702+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T23:01:14.654+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:72 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T23:01:14.655+08:00 level=INFO source=ggml.go:482 msg="offloading 0 repeating layers to GPU" time=2026-02-05T23:01:14.655+08:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-02-05T23:01:14.655+08:00 level=INFO source=ggml.go:494 msg="offloaded 0/49 layers to GPU" time=2026-02-05T23:01:14.655+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="18.2 GiB" time=2026-02-05T23:01:14.655+08:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="384.0 MiB" time=2026-02-05T23:01:14.655+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="364.0 MiB" time=2026-02-05T23:01:14.655+08:00 level=INFO source=device.go:272 msg="total memory" size="19.0 GiB"
Author
Owner

@gymm-betrayer commented on GitHub (Feb 5, 2026):

@rick-github This is our full log. Please check.

<!-- gh-comment-id:3854280005 --> @gymm-betrayer commented on GitHub (Feb 5, 2026): @rick-github This is our full log. Please check.
Author
Owner

@rick-github commented on GitHub (Feb 5, 2026):

Set OLLAMA_DEBUG=2 in the server environment, restart the ollama server, and post the output of:

journalctl -u ollama --no-pager --since "$(systemctl show ollama --property=ActiveEnterTimestamp --value)" | sed -ne '/server config/,/inference compute/p'
<!-- gh-comment-id:3854325438 --> @rick-github commented on GitHub (Feb 5, 2026): Set `OLLAMA_DEBUG=2` in the [server environment](https://github.com/ollama/ollama/blob/main/docs/faq.mdx#setting-environment-variables-on-linux), restart the ollama server, and post the output of: ``` journalctl -u ollama --no-pager --since "$(systemctl show ollama --property=ActiveEnterTimestamp --value)" | sed -ne '/server config/,/inference compute/p' ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55715