[GH-ISSUE #15641] gemma4 -flash attention disabled --GPU: Tesla V100--ollama version 0.20.7 #35737

Open
opened 2026-04-22 20:25:38 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @swk117 on GitHub (Apr 17, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15641

What is the issue?

I tested two models: gemma4:31b-it-q8_0 and gemma4:26b-a4b-it-q8_0. Flash Attention disabled. The environment is: a server equipped with two GPUs (Tesla V100-32G), Ollama version 0.20.7 (which I see claims to support Gemma 4 Flash Attention).
The same server environment, I also tested qwen3.5:27b-q8_0 and qwen3.6:35b-a3b-q8_0 — FlashAttention Enabled.
Ubuntu 22.04.5 LTS ; CUDA Version: 12.2 ;Ollama 0.20.7

Relevant log output

Apr 17 11:17:33 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:33 | 200 |       89.75µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:17:33 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:33 | 200 |     268.722µs |       127.0.0.1 | GET      "/api/ps"
Apr 17 11:17:35 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:35 | 200 |      67.843µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:17:35 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:35 | 200 |      26.107µs |       127.0.0.1 | GET      "/api/ps"
Apr 17 11:17:39 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:39 | 200 |      48.112µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:17:40 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:40 | 200 |  666.375868ms |       127.0.0.1 | POST     "/api/show"
Apr 17 11:17:40 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:40 | 200 |  671.534583ms |       127.0.0.1 | POST     "/api/show"
Apr 17 11:17:41 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:41.703+08:00 level=INFO source=server.go:444 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46699"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.615+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.615+08:00 level=WARN source=server.go:270 msg="quantized kv cache requested but flash attention disabled" type=q8_0
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=server.go:444 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /mnt/data/ollama/models/blobs/sha256-a0feadb736f521df6de4b1bd3cbf06c00f9fd04570ddc1e47b8ec9ecbbd6b51d --port 35557"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:484 msg="system memory" total="503.7 GiB" free="498.6 GiB" free_swap="8.0 GiB"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f library=CUDA available="31.3 GiB" free="31.7 GiB" minimum="457.0 MiB" overhead="0 B"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-8903761c-f5a9-23c1-398c-0536a7886912 library=CUDA available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=server.go:771 msg="loading model" "model layers"=61 requested=-1
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.641+08:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.641+08:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:35557"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.651+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.783+08:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q8_0 name="" description="" num_tensors=1189 num_key_values=49
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: found 2 CUDA devices:
Apr 17 11:17:42 LLM-T01-Server ollama[256526]:   Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f
Apr 17 11:17:42 LLM-T01-Server ollama[256526]:   Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-8903761c-f5a9-23c1-398c-0536a7886912
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.975+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.996+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.039+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=3.971884ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=313.311386ms size="[768 768]"
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.355+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=319.29888ms shape="[5376 256]"
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.212+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:28(0..27) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:33(28..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.378+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.411+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=2.413845ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=285.244238ms size="[768 768]"
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.698+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=289.605084ms shape="[5376 256]"
Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.503+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.694+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.742+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=6.097521ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.045+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=303.313213ms size="[768 768]"
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.046+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.046+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.047+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=311.410631ms shape="[5376 256]"
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.665+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.869+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.933+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=4.648597ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.172+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=238.741814ms size="[768 768]"
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.176+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.176+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.177+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=248.797295ms shape="[5376 256]"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.094+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:482 msg="offloading 60 repeating layers to GPU"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:494 msg="offloaded 61/61 layers to GPU"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="15.4 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="16.0 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="1.5 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="5.2 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="4.9 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="8.3 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="8.2 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.5 MiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:272 msg="total memory" size="59.6 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=server.go:1364 msg="waiting for llama runner to start responding"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.096+08:00 level=INFO source=server.go:1398 msg="waiting for server to become available" status="llm server loading model"
Apr 17 11:18:02 LLM-T01-Server ollama[256526]: time=2026-04-17T11:18:02.173+08:00 level=INFO source=server.go:1402 msg="llama runner started in 19.56 seconds"
Apr 17 11:18:02 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:18:02 | 200 | 21.225892456s |       127.0.0.1 | POST     "/api/generate"
Apr 17 11:19:06 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:19:06 | 200 |       64.15µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:19:06 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:19:06 | 200 |      51.718µs |       127.0.0.1 | GET      "/api/ps"

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @swk117 on GitHub (Apr 17, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15641 ### What is the issue? I tested two models: gemma4:31b-it-q8_0 and gemma4:26b-a4b-it-q8_0. Flash Attention disabled. The environment is: a server equipped with two GPUs (Tesla V100-32G), Ollama version 0.20.7 (which I see claims to support Gemma 4 Flash Attention). The same server environment, I also tested qwen3.5:27b-q8_0 and qwen3.6:35b-a3b-q8_0 — FlashAttention Enabled. Ubuntu 22.04.5 LTS ; CUDA Version: 12.2 ;Ollama 0.20.7 ### Relevant log output ```shell Apr 17 11:17:33 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:33 | 200 | 89.75µs | 127.0.0.1 | HEAD "/" Apr 17 11:17:33 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:33 | 200 | 268.722µs | 127.0.0.1 | GET "/api/ps" Apr 17 11:17:35 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:35 | 200 | 67.843µs | 127.0.0.1 | HEAD "/" Apr 17 11:17:35 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:35 | 200 | 26.107µs | 127.0.0.1 | GET "/api/ps" Apr 17 11:17:39 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:39 | 200 | 48.112µs | 127.0.0.1 | HEAD "/" Apr 17 11:17:40 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:40 | 200 | 666.375868ms | 127.0.0.1 | POST "/api/show" Apr 17 11:17:40 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:40 | 200 | 671.534583ms | 127.0.0.1 | POST "/api/show" Apr 17 11:17:41 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:41.703+08:00 level=INFO source=server.go:444 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46699" Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.615+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.615+08:00 level=WARN source=server.go:270 msg="quantized kv cache requested but flash attention disabled" type=q8_0 Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=server.go:444 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /mnt/data/ollama/models/blobs/sha256-a0feadb736f521df6de4b1bd3cbf06c00f9fd04570ddc1e47b8ec9ecbbd6b51d --port 35557" Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:484 msg="system memory" total="503.7 GiB" free="498.6 GiB" free_swap="8.0 GiB" Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f library=CUDA available="31.3 GiB" free="31.7 GiB" minimum="457.0 MiB" overhead="0 B" Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-8903761c-f5a9-23c1-398c-0536a7886912 library=CUDA available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B" Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=server.go:771 msg="loading model" "model layers"=61 requested=-1 Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.641+08:00 level=INFO source=runner.go:1417 msg="starting ollama engine" Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.641+08:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:35557" Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.651+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.783+08:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q8_0 name="" description="" num_tensors=1189 num_key_values=49 Apr 17 11:17:42 LLM-T01-Server ollama[256526]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: found 2 CUDA devices: Apr 17 11:17:42 LLM-T01-Server ollama[256526]: Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Apr 17 11:17:42 LLM-T01-Server ollama[256526]: Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-8903761c-f5a9-23c1-398c-0536a7886912 Apr 17 11:17:42 LLM-T01-Server ollama[256526]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.975+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.996+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.039+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=3.971884ms bounds=(0,0)-(2048,2048) Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=313.311386ms size="[768 768]" Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.355+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=319.29888ms shape="[5376 256]" Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.212+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:28(0..27) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:33(28..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.378+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.411+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=2.413845ms bounds=(0,0)-(2048,2048) Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=285.244238ms size="[768 768]" Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.698+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=289.605084ms shape="[5376 256]" Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.503+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.694+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.742+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=6.097521ms bounds=(0,0)-(2048,2048) Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.045+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=303.313213ms size="[768 768]" Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.046+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.046+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.047+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=311.410631ms shape="[5376 256]" Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.665+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.869+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.933+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=4.648597ms bounds=(0,0)-(2048,2048) Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.172+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=238.741814ms size="[768 768]" Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.176+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.176+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.177+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=248.797295ms shape="[5376 256]" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.094+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:482 msg="offloading 60 repeating layers to GPU" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:494 msg="offloaded 61/61 layers to GPU" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="15.4 GiB" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="16.0 GiB" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="1.5 GiB" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="5.2 GiB" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="4.9 GiB" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="8.3 GiB" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="8.2 GiB" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.5 MiB" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:272 msg="total memory" size="59.6 GiB" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=sched.go:561 msg="loaded runners" count=1 Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=server.go:1364 msg="waiting for llama runner to start responding" Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.096+08:00 level=INFO source=server.go:1398 msg="waiting for server to become available" status="llm server loading model" Apr 17 11:18:02 LLM-T01-Server ollama[256526]: time=2026-04-17T11:18:02.173+08:00 level=INFO source=server.go:1402 msg="llama runner started in 19.56 seconds" Apr 17 11:18:02 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:18:02 | 200 | 21.225892456s | 127.0.0.1 | POST "/api/generate" Apr 17 11:19:06 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:19:06 | 200 | 64.15µs | 127.0.0.1 | HEAD "/" Apr 17 11:19:06 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:19:06 | 200 | 51.718µs | 127.0.0.1 | GET "/api/ps" ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-22 20:25:38 -05:00
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15641
Analyzed: 2026-04-18T18:13:43.316965

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274294813 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15641 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15641 **Analyzed**: 2026-04-18T18:13:43.316965 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35737