[GH-ISSUE #10170] Multimodal broken in 6.5? #32433

Closed
opened 2026-04-22 13:43:08 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @hherb on GitHub (Apr 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10170

What is the issue?

Since the update to 6.5, ollama does not seem to process images any more - regardless whether passing them as context in my python app (either as image path or base64 string) or via CLI (ollama run "describe the image ". Ran fine with 6.4. Seems to happen regardless of which model used (eg llama3.2 vision, Gemma 3, ...)

Relevant log output

ollama run mistral-small3.1:24b-instruct-2503-q8_0 "describe the image /Users/hherb/icons/cartoon.png"   
I'm unable to directly access or view images, including the one you 
mentioned at the path "/Users/hherb/icons/cartoon.png". However, 
if you can describe the image to me or provide details about it, I'd be 
happy to help with any information or analysis based on your description!

OS

MacOS

GPU

M3 Max

CPU

M3 Max

Ollama version

6.5

Originally created by @hherb on GitHub (Apr 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10170 ### What is the issue? Since the update to 6.5, ollama does not seem to process images any more - regardless whether passing them as context in my python app (either as image path or base64 string) or via CLI (ollama run <model> "describe the image <image path>". Ran fine with 6.4. Seems to happen regardless of which model used (eg llama3.2 vision, Gemma 3, ...) ### Relevant log output ```shell ollama run mistral-small3.1:24b-instruct-2503-q8_0 "describe the image /Users/hherb/icons/cartoon.png" I'm unable to directly access or view images, including the one you mentioned at the path "/Users/hherb/icons/cartoon.png". However, if you can describe the image to me or provide details about it, I'd be happy to help with any information or analysis based on your description! ``` ### OS MacOS ### GPU M3 Max ### CPU M3 Max ### Ollama version 6.5
GiteaMirror added the bug label 2026-04-22 13:43:09 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 7, 2025):

The CLI didn't say 'Added image', so it's not sending the image to the server. Are your CLI and server the same version? What's the output of ollama -v?

For example:

$ ollama -v
ollama version is 0.6.5
$ ollama run mistral-small3.1:24b-instruct-2503-q8_0 "describe the image /home/rick/puppy.png"
Added image '/home/rick/puppy.png'
The image depicts a small, fluffy white puppy sitting on a concrete surface, likely a step or a curb.
<!-- gh-comment-id:2784727506 --> @rick-github commented on GitHub (Apr 7, 2025): The CLI didn't say 'Added image', so it's not sending the image to the server. Are your CLI and server the same version? What's the output of `ollama -v`? For example: ```console $ ollama -v ollama version is 0.6.5 $ ollama run mistral-small3.1:24b-instruct-2503-q8_0 "describe the image /home/rick/puppy.png" Added image '/home/rick/puppy.png' The image depicts a small, fluffy white puppy sitting on a concrete surface, likely a step or a curb. ```
Author
Owner

@jmmcd commented on GitHub (Apr 8, 2025):

I'm getting weird behaviour too:

$ ollama run gemma3:4b "what do you see here? ./test.png"
Added image './test.png'
You are absolutely right! My apologies. I misread the image.

The image shows a **group of adorable cartoon animals - a cat, a fox, an owl, an otter,^C

[Voiceover: no, it doesn't]

$ ollama -v
ollama version is 0.6.5
<!-- gh-comment-id:2786839642 --> @jmmcd commented on GitHub (Apr 8, 2025): I'm getting weird behaviour too: ``` $ ollama run gemma3:4b "what do you see here? ./test.png" Added image './test.png' You are absolutely right! My apologies. I misread the image. The image shows a **group of adorable cartoon animals - a cat, a fox, an owl, an otter,^C ``` [Voiceover: no, it doesn't] ``` $ ollama -v ollama version is 0.6.5 ```
Author
Owner

@rick-github commented on GitHub (Apr 8, 2025):

Can you add test.png here?

<!-- gh-comment-id:2786861329 --> @rick-github commented on GitHub (Apr 8, 2025): Can you add test.png here?
Author
Owner

@jmmcd commented on GitHub (Apr 8, 2025):

It's a dog from a google image search.

Image
<!-- gh-comment-id:2786880583 --> @jmmcd commented on GitHub (Apr 8, 2025): It's a dog from a google image search. <img width="305" alt="Image" src="https://github.com/user-attachments/assets/2b197cd3-e882-4342-a253-d0aca683ad4a" />
Author
Owner

@rick-github commented on GitHub (Apr 8, 2025):

$ ollama:0.6.5 run gemma3:4b "what do you see here? ./test.png"
Added image './test.png'
Here's what I see in the image:

*   **A German Shepherd Dog:** A beautiful, classic-looking German Shepherd is the main subject. It's standing on a rocky hillside, with its mouth open as if panting.
*   **Mountainous Landscape:** Behind the dog, there's a dramatic mountain range with snow-capped peaks. The sky is partly cloudy.
*   **Rocky Terrain:** The dog is standing on a rugged, rocky hillside.
*   **Green Vegetation:** There's some green vegetation at the base of the hillside.

It looks like a stunning outdoor shot!

What hardware do you have? Server logs may aid in diagnosis.

<!-- gh-comment-id:2787004940 --> @rick-github commented on GitHub (Apr 8, 2025): ```console $ ollama:0.6.5 run gemma3:4b "what do you see here? ./test.png" Added image './test.png' Here's what I see in the image: * **A German Shepherd Dog:** A beautiful, classic-looking German Shepherd is the main subject. It's standing on a rocky hillside, with its mouth open as if panting. * **Mountainous Landscape:** Behind the dog, there's a dramatic mountain range with snow-capped peaks. The sky is partly cloudy. * **Rocky Terrain:** The dog is standing on a rugged, rocky hillside. * **Green Vegetation:** There's some green vegetation at the base of the hillside. It looks like a stunning outdoor shot! ``` What hardware do you have? [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in diagnosis.
Author
Owner

@jmmcd commented on GitHub (Apr 8, 2025):

Ah, ok. The same model gemma3:4b is also giving bad output even dealing with pure text. gemma3:1b (which is text-only) runs fine so I assumed it was a vision issue, but it's not - apologies.

I'm on Mac M2. Yes, there is some error in my logs, only for 4b. Maybe it's too big for my machine.

ggml_metal_graph_compute: command buffer 1 failed with status 5
error: Internal Error (0000000e:Internal Error)

(Full log below)

[GIN] 2025/04/08 - 17:49:14 | 200 |     246.042µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/04/08 - 17:49:14 | 200 |  112.256583ms |       127.0.0.1 | POST     "/api/show"
time=2025-04-08T17:49:14.939+01:00 level=INFO source=sched.go:716 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/jmmcd/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 gpu=0 parallel=4 available=11453251584 required="5.8 GiB"
time=2025-04-08T17:49:14.940+01:00 level=INFO source=server.go:105 msg="system memory" total="16.0 GiB" free="3.5 GiB" free_swap="0 B"
time=2025-04-08T17:49:14.942+01:00 level=INFO source=server.go:138 msg=offload library=metal layers.requested=-1 layers.model=35 layers.offload=35 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.8 GiB" memory.required.partial="5.8 GiB" memory.required.kv="682.0 MiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="1.8 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="517.0 MiB" memory.graph.partial="517.0 MiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-04-08T17:49:15.037+01:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-08T17:49:15.045+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/jmmcd/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 8192 --batch-size 512 --n-gpu-layers 35 --threads 4 --parallel 4 --port 64976"
time=2025-04-08T17:49:15.047+01:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-08T17:49:15.047+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-08T17:49:15.048+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-08T17:49:15.062+01:00 level=INFO source=runner.go:816 msg="starting ollama engine"
time=2025-04-08T17:49:15.063+01:00 level=INFO source=runner.go:879 msg="Server listening on 127.0.0.1:64976"
time=2025-04-08T17:49:15.160+01:00 level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""
time=2025-04-08T17:49:15.160+01:00 level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
time=2025-04-08T17:49:15.160+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=36
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-icelake.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-haswell.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-alderlake.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-sandybridge.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-skylakex.so
time=2025-04-08T17:49:15.165+01:00 level=INFO source=ggml.go:109 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
time=2025-04-08T17:49:15.299+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
time=2025-04-08T17:49:15.306+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=Metal size="3.1 GiB"
time=2025-04-08T17:49:15.307+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="525.0 MiB"
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = false
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
time=2025-04-08T17:49:18.840+01:00 level=INFO source=ggml.go:388 msg="compute graph" backend=Metal buffer_type=Metal
time=2025-04-08T17:49:18.840+01:00 level=INFO source=ggml.go:388 msg="compute graph" backend=CPU buffer_type=CPU
time=2025-04-08T17:49:18.853+01:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-08T17:49:19.069+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.02 seconds"
ggml_metal_graph_compute: command buffer 1 failed with status 5
error: Internal Error (0000000e:Internal Error)
[GIN] 2025/04/08 - 17:49:51 | 200 | 36.334630667s |       127.0.0.1 | POST     "/api/generate"
<!-- gh-comment-id:2787120767 --> @jmmcd commented on GitHub (Apr 8, 2025): Ah, ok. The same model `gemma3:4b` is also giving bad output even dealing with pure text. `gemma3:1b` (which is text-only) runs fine so I assumed it was a vision issue, but it's not - apologies. I'm on Mac M2. Yes, there is some error in my logs, only for `4b`. Maybe it's too big for my machine. ``` ggml_metal_graph_compute: command buffer 1 failed with status 5 error: Internal Error (0000000e:Internal Error) ``` (Full log below) ``` [GIN] 2025/04/08 - 17:49:14 | 200 | 246.042µs | 127.0.0.1 | HEAD "/" [GIN] 2025/04/08 - 17:49:14 | 200 | 112.256583ms | 127.0.0.1 | POST "/api/show" time=2025-04-08T17:49:14.939+01:00 level=INFO source=sched.go:716 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/jmmcd/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 gpu=0 parallel=4 available=11453251584 required="5.8 GiB" time=2025-04-08T17:49:14.940+01:00 level=INFO source=server.go:105 msg="system memory" total="16.0 GiB" free="3.5 GiB" free_swap="0 B" time=2025-04-08T17:49:14.942+01:00 level=INFO source=server.go:138 msg=offload library=metal layers.requested=-1 layers.model=35 layers.offload=35 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.8 GiB" memory.required.partial="5.8 GiB" memory.required.kv="682.0 MiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="1.8 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="517.0 MiB" memory.graph.partial="517.0 MiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" time=2025-04-08T17:49:15.037+01:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-08T17:49:15.045+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/jmmcd/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 8192 --batch-size 512 --n-gpu-layers 35 --threads 4 --parallel 4 --port 64976" time=2025-04-08T17:49:15.047+01:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-08T17:49:15.047+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-08T17:49:15.048+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-08T17:49:15.062+01:00 level=INFO source=runner.go:816 msg="starting ollama engine" time=2025-04-08T17:49:15.063+01:00 level=INFO source=runner.go:879 msg="Server listening on 127.0.0.1:64976" time=2025-04-08T17:49:15.160+01:00 level=WARN source=ggml.go:152 msg="key not found" key=general.name default="" time=2025-04-08T17:49:15.160+01:00 level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" time=2025-04-08T17:49:15.160+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=36 ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-icelake.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-haswell.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-alderlake.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-sandybridge.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-skylakex.so time=2025-04-08T17:49:15.165+01:00 level=INFO source=ggml.go:109 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) time=2025-04-08T17:49:15.299+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" time=2025-04-08T17:49:15.306+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=Metal size="3.1 GiB" time=2025-04-08T17:49:15.307+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="525.0 MiB" ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 ggml_metal_init: picking default device: Apple M2 ggml_metal_init: using embedded metal library ggml_metal_init: GPU name: Apple M2 ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction = true ggml_metal_init: simdgroup matrix mul. = true ggml_metal_init: has residency sets = false ggml_metal_init: has bfloat = true ggml_metal_init: use bfloat = false ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB ggml_metal_init: skipping kernel_get_rows_bf16 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported) ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported) ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported) ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported) time=2025-04-08T17:49:18.840+01:00 level=INFO source=ggml.go:388 msg="compute graph" backend=Metal buffer_type=Metal time=2025-04-08T17:49:18.840+01:00 level=INFO source=ggml.go:388 msg="compute graph" backend=CPU buffer_type=CPU time=2025-04-08T17:49:18.853+01:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-08T17:49:19.069+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.02 seconds" ggml_metal_graph_compute: command buffer 1 failed with status 5 error: Internal Error (0000000e:Internal Error) [GIN] 2025/04/08 - 17:49:51 | 200 | 36.334630667s | 127.0.0.1 | POST "/api/generate" ```
Author
Owner

@aaricantto commented on GitHub (Apr 13, 2025):

Any Updates on this? I've tried a couple things - and it's only the multi-modal models that aren't playing nice.

# ollama -v
ollama version is 0.6.5

Mine won't even load into the GPU

[GIN] 2025/04/13 - 11:17:42 | 200 |   30.431149ms |       127.0.0.1 | POST     "/api/show"
time=2025-04-13T11:17:43.227Z level=INFO source=sched.go:716 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc gpu=GPU-f96d9f70-dd21-fdd1-d520-324d04f9966c parallel=10 available=50646548480 required="33.9 GiB"
time=2025-04-13T11:17:43.352Z level=INFO source=server.go:105 msg="system memory" total="124.8 GiB" free="121.9 GiB" free_swap="0 B"
time=2025-04-13T11:17:43.482Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[47.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="33.9 GiB" memory.required.partial="33.9 GiB" memory.required.kv="6.2 GiB" memory.required.allocations="[33.9 GiB]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="4.2 GiB" memory.graph.partial="4.2 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB"
time=2025-04-13T11:17:43.482Z level=INFO source=server.go:185 msg="enabling flash attention"
time=2025-04-13T11:17:43.482Z level=WARN source=server.go:193 msg="kv cache type not supported by model" type=""
time=2025-04-13T11:17:43.514Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.rope.freq_scale default=1
time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.vision.longest_edge default=1540
time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06
time=2025-04-13T11:17:43.521Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc --ctx-size 40960 --batch-size 512 --n-gpu-layers 41 --threads 12 --flash-attn --parallel 10 --port 33905"
time=2025-04-13T11:17:43.521Z level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-13T11:17:43.521Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-13T11:17:43.521Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-13T11:17:43.529Z level=INFO source=runner.go:816 msg="starting ollama engine"
time=2025-04-13T11:17:43.529Z level=INFO source=runner.go:879 msg="Server listening on 127.0.0.1:33905"
time=2025-04-13T11:17:43.564Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""
time=2025-04-13T11:17:43.564Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
time=2025-04-13T11:17:43.564Z level=INFO source=ggml.go:67 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=43
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-04-13T11:17:43.612Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-04-13T11:17:43.772Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
time=2025-04-13T11:17:43.845Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="13.9 GiB"
time=2025-04-13T11:17:43.845Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="525.0 MiB"

From here while the model is still "waiting for the server to become available" I run "ollama ps" and see 100% GPU Utilization?

# ollama ps
NAME                    ID              SIZE     PROCESSOR    UNTIL
mistral-small3.1:24b    b9aaf0c2586a    36 GB    100% GPU     4 minutes from now

If I kill the terminal I then see this!

time=2025-04-13T11:20:45.204Z level=WARN source=server.go:587 msg="client connection closed before server finished loading, aborting load"
time=2025-04-13T11:20:45.204Z level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
[GIN] 2025/04/13 - 11:20:45 | 499 |  7.280213019s |       127.0.0.1 | POST     "/api/generate"
time=2025-04-13T11:20:50.213Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.008396224 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-04-13T11:20:50.434Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.229436417 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-04-13T11:20:50.684Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.479803228 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
<!-- gh-comment-id:2799912569 --> @aaricantto commented on GitHub (Apr 13, 2025): Any Updates on this? I've tried a couple things - and it's only the multi-modal models that aren't playing nice. ``` # ollama -v ollama version is 0.6.5 ``` Mine won't even load into the GPU ``` [GIN] 2025/04/13 - 11:17:42 | 200 | 30.431149ms | 127.0.0.1 | POST "/api/show" time=2025-04-13T11:17:43.227Z level=INFO source=sched.go:716 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc gpu=GPU-f96d9f70-dd21-fdd1-d520-324d04f9966c parallel=10 available=50646548480 required="33.9 GiB" time=2025-04-13T11:17:43.352Z level=INFO source=server.go:105 msg="system memory" total="124.8 GiB" free="121.9 GiB" free_swap="0 B" time=2025-04-13T11:17:43.482Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[47.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="33.9 GiB" memory.required.partial="33.9 GiB" memory.required.kv="6.2 GiB" memory.required.allocations="[33.9 GiB]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="4.2 GiB" memory.graph.partial="4.2 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" time=2025-04-13T11:17:43.482Z level=INFO source=server.go:185 msg="enabling flash attention" time=2025-04-13T11:17:43.482Z level=WARN source=server.go:193 msg="kv cache type not supported by model" type="" time=2025-04-13T11:17:43.514Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.rope.freq_scale default=1 time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.vision.longest_edge default=1540 time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 time=2025-04-13T11:17:43.521Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc --ctx-size 40960 --batch-size 512 --n-gpu-layers 41 --threads 12 --flash-attn --parallel 10 --port 33905" time=2025-04-13T11:17:43.521Z level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-13T11:17:43.521Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-13T11:17:43.521Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-13T11:17:43.529Z level=INFO source=runner.go:816 msg="starting ollama engine" time=2025-04-13T11:17:43.529Z level=INFO source=runner.go:879 msg="Server listening on 127.0.0.1:33905" time=2025-04-13T11:17:43.564Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default="" time=2025-04-13T11:17:43.564Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" time=2025-04-13T11:17:43.564Z level=INFO source=ggml.go:67 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=43 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-04-13T11:17:43.612Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-04-13T11:17:43.772Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" time=2025-04-13T11:17:43.845Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="13.9 GiB" time=2025-04-13T11:17:43.845Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="525.0 MiB" ``` From here while the model is still "waiting for the server to become available" I run "ollama ps" and see 100% GPU Utilization? ``` # ollama ps NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:24b b9aaf0c2586a 36 GB 100% GPU 4 minutes from now ``` If I kill the terminal I then see this! ``` time=2025-04-13T11:20:45.204Z level=WARN source=server.go:587 msg="client connection closed before server finished loading, aborting load" time=2025-04-13T11:20:45.204Z level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" [GIN] 2025/04/13 - 11:20:45 | 499 | 7.280213019s | 127.0.0.1 | POST "/api/generate" time=2025-04-13T11:20:50.213Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.008396224 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-04-13T11:20:50.434Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.229436417 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-04-13T11:20:50.684Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.479803228 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc ```
Author
Owner

@rick-github commented on GitHub (Apr 13, 2025):

time=2025-04-13T11:20:45.204Z level=WARN source=server.go:587 msg="client connection closed before server finished loading, aborting load"

Client closed the connection before the model had finished loading. Since the bit of the log between 11:17:43 and 11:20:45 wasn't posted, it's a bit hard to see what's going on, but I suspect that either the load hasn't completed, or the client sent a request that changed the model parameters (eg different context size) causing a model eviction and reload. As the client only waited 7.28 seconds before closing, the model hadn't finished loading so the load was aborted.

<!-- gh-comment-id:2799934709 --> @rick-github commented on GitHub (Apr 13, 2025): ``` time=2025-04-13T11:20:45.204Z level=WARN source=server.go:587 msg="client connection closed before server finished loading, aborting load" ``` Client closed the connection before the model had finished loading. Since the bit of the log between 11:17:43 and 11:20:45 wasn't posted, it's a bit hard to see what's going on, but I suspect that either the load hasn't completed, or the client sent a request that changed the model parameters (eg different context size) causing a model eviction and reload. As the client only waited 7.28 seconds before closing, the model hadn't finished loading so the load was aborted.
Author
Owner

@aaricantto commented on GitHub (Apr 13, 2025):

Sorry for cutting it off early - I did so to reproduce the error. Even at 14 minutes you can see the model does not load! mistral-small:24b takes seconds to load into the GPU - could there be an issue with my docker-compose configuration?

GPU

RTX A6000 - 48Gb VRAM

My Errors

Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-04-13T13:41:07.726Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-04-13T13:41:07.840Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="13.9 GiB"
time=2025-04-13T13:41:07.840Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="525.0 MiB"
time=2025-04-13T13:41:07.848Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"

Waiting forever...

time=2025-04-13T13:55:33.232Z level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
[GIN] 2025/04/13 - 13:55:33 | 499 |        14m27s |       127.0.0.1 | POST     "/api/generate"
time=2025-04-13T13:55:38.297Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.064602677 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-04-13T13:55:38.516Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.283946761 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-04-13T13:55:38.767Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.534428788 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc

docker-compose.yml

services:
  ollama:
    image: ollama/ollama
    runtime: nvidia  # Ensure NVIDIA runtime is set
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - OLLAMA_FLASH_ATTENTION=true
      - NVIDIA_VISIBLE_DEVICES=all
      - OLLAMA_MAX_QUEUE=10   # Adjust this to the desired number of workers
      - OLLAMA_MAX_LOADED_MODELS=5    # Adjust this to the desired number of models in RAM
      - OLLAMA_NUM_PARALLEL=10
    ports:
      - "11434:11434"
    volumes:
      # Mount the NFS volume to a specific subdirectory
      - ollama_data:/root/.ollama
    networks:
      - icubed-network
    pull_policy: always  # Always pull the latest image


networks:
  icubed-network:
    external: true

volumes:
  ollama_data:
    driver: local
    driver_opts:
      type: "nfs"
      o: "addr=nfs.icubed.com.au,rw,nolock,soft"
      device: ":/Dev/ollama"  # This must be created first on the NFS server

EDIT

Client experiences similar issues with gemma3 vision models - running with docker ollama

<!-- gh-comment-id:2799963894 --> @aaricantto commented on GitHub (Apr 13, 2025): Sorry for cutting it off early - I did so to reproduce the error. Even at 14 minutes you can see the model does not load! mistral-small:24b takes seconds to load into the GPU - could there be an issue with my docker-compose configuration? ### GPU RTX A6000 - 48Gb VRAM ### My Errors ``` Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-04-13T13:41:07.726Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-04-13T13:41:07.840Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="13.9 GiB" time=2025-04-13T13:41:07.840Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="525.0 MiB" time=2025-04-13T13:41:07.848Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" ``` Waiting forever... ``` time=2025-04-13T13:55:33.232Z level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" [GIN] 2025/04/13 - 13:55:33 | 499 | 14m27s | 127.0.0.1 | POST "/api/generate" time=2025-04-13T13:55:38.297Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.064602677 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-04-13T13:55:38.516Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.283946761 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-04-13T13:55:38.767Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.534428788 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc ``` ### docker-compose.yml ``` services: ollama: image: ollama/ollama runtime: nvidia # Ensure NVIDIA runtime is set deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] environment: - OLLAMA_FLASH_ATTENTION=true - NVIDIA_VISIBLE_DEVICES=all - OLLAMA_MAX_QUEUE=10 # Adjust this to the desired number of workers - OLLAMA_MAX_LOADED_MODELS=5 # Adjust this to the desired number of models in RAM - OLLAMA_NUM_PARALLEL=10 ports: - "11434:11434" volumes: # Mount the NFS volume to a specific subdirectory - ollama_data:/root/.ollama networks: - icubed-network pull_policy: always # Always pull the latest image networks: icubed-network: external: true volumes: ollama_data: driver: local driver_opts: type: "nfs" o: "addr=nfs.icubed.com.au,rw,nolock,soft" device: ":/Dev/ollama" # This must be created first on the NFS server ```` ### EDIT Client experiences similar issues with gemma3 vision models - running with docker ollama
Author
Owner

@rick-github commented on GitHub (Apr 13, 2025):

You're loading off a network device. What's the bandwidth of your connection to the NFS server? What does the following show:

docker exec -it ollama dd if=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc of=/dev/null```
<!-- gh-comment-id:2799974477 --> @rick-github commented on GitHub (Apr 13, 2025): You're loading off a network device. What's the bandwidth of your connection to the NFS server? What does the following show: ``` docker exec -it ollama dd if=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc of=/dev/null```
Author
Owner

@aaricantto commented on GitHub (Apr 13, 2025):

NFS was not the problem - for some reason running the "pull" command managed to fix this!

I'm not sure why though so will leave the thread here in case others have the same problem -> managed to fix both mistral and gemma vision models.

root@d234b833a1d2:/# ollama pull mistral-small3.1:24b
pulling manifest
pulling 1fa8532d986d... 100% ▕██████████████████████▏  15 GB
pulling 6db27cd4e277... 100% ▕██████████████████████▏  695 B
pulling 70a4dab5e1d1... 100% ▕██████████████████████▏ 1.5 KB
pulling a00920c28dfd... 100% ▕██████████████████████▏   17 B
pulling 9b6ac0d4e97e... 100% ▕██████████████████████▏  494 B
verifying sha256 digest
writing manifest
success
root@d234b833a1d2:/# ollama run mistral-small3.1:24b
>>> hello how are you?
Hello! I'm functioning well, thank you. How can I assist you today?

PS: NFS Bandwidth is 500Mbps

PPS: Feel like an idiot for not trying this 2 hours ago

<!-- gh-comment-id:2799975217 --> @aaricantto commented on GitHub (Apr 13, 2025): NFS was not the problem - for some reason running the "pull" command managed to fix this! I'm not sure why though so will leave the thread here in case others have the same problem -> managed to fix both mistral and gemma vision models. ``` root@d234b833a1d2:/# ollama pull mistral-small3.1:24b pulling manifest pulling 1fa8532d986d... 100% ▕██████████████████████▏ 15 GB pulling 6db27cd4e277... 100% ▕██████████████████████▏ 695 B pulling 70a4dab5e1d1... 100% ▕██████████████████████▏ 1.5 KB pulling a00920c28dfd... 100% ▕██████████████████████▏ 17 B pulling 9b6ac0d4e97e... 100% ▕██████████████████████▏ 494 B verifying sha256 digest writing manifest success root@d234b833a1d2:/# ollama run mistral-small3.1:24b >>> hello how are you? Hello! I'm functioning well, thank you. How can I assist you today? ``` PS: NFS Bandwidth is 500Mbps PPS: Feel like an idiot for not trying this 2 hours ago
Author
Owner

@rick-github commented on GitHub (Apr 13, 2025):

It's likely that the model is now in page cache and so is not reliant on pulling from the NFS server. If the model is evicted from the page cache, you may experience these loading problems again.

<!-- gh-comment-id:2799976138 --> @rick-github commented on GitHub (Apr 13, 2025): It's likely that the model is now in page cache and so is not reliant on pulling from the NFS server. If the model is evicted from the page cache, you may experience these loading problems again.
Author
Owner

@aaricantto commented on GitHub (Apr 13, 2025):

Ahhhh I understand what you're saying now - it's a problem with my docker-swarm setup... I need to dedicate machines / docker volumes for LLM workloads! Thanks!

<!-- gh-comment-id:2799983011 --> @aaricantto commented on GitHub (Apr 13, 2025): Ahhhh I understand what you're saying now - it's a problem with my docker-swarm setup... I need to dedicate machines / docker volumes for LLM workloads! Thanks!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32433