[GH-ISSUE #10170] Multimodal broken in 6.5? #32433

New Issue

GiteaMirror · 2026-04-22T13:43:08-05:00

GiteaMirror commented

2026-04-22 13:43:08 -05:00

Originally created by @hherb on GitHub (Apr 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10170

What is the issue?

Since the update to 6.5, ollama does not seem to process images any more - regardless whether passing them as context in my python app (either as image path or base64 string) or via CLI (ollama run "describe the image ". Ran fine with 6.4. Seems to happen regardless of which model used (eg llama3.2 vision, Gemma 3, ...)

Relevant log output

ollama run mistral-small3.1:24b-instruct-2503-q8_0 "describe the image /Users/hherb/icons/cartoon.png"   
I'm unable to directly access or view images, including the one you 
mentioned at the path "/Users/hherb/icons/cartoon.png". However, 
if you can describe the image to me or provide details about it, I'd be 
happy to help with any information or analysis based on your description!

OS

MacOS

GPU

M3 Max

CPU

M3 Max

Ollama version

6.5

Originally created by @hherb on GitHub (Apr 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10170 ### What is the issue? Since the update to 6.5, ollama does not seem to process images any more - regardless whether passing them as context in my python app (either as image path or base64 string) or via CLI (ollama run <model> "describe the image <image path>". Ran fine with 6.4. Seems to happen regardless of which model used (eg llama3.2 vision, Gemma 3, ...) ### Relevant log output ```shell ollama run mistral-small3.1:24b-instruct-2503-q8_0 "describe the image /Users/hherb/icons/cartoon.png" I'm unable to directly access or view images, including the one you mentioned at the path "/Users/hherb/icons/cartoon.png". However, if you can describe the image to me or provide details about it, I'd be happy to help with any information or analysis based on your description! ``` ### OS MacOS ### GPU M3 Max ### CPU M3 Max ### Ollama version 6.5

GiteaMirror added the bug label 2026-04-22 13:43:09 -05:00

GiteaMirror closed this issue

2026-04-22 13:43:10 -05:00

GiteaMirror commented

2026-04-22 13:43:12 -05:00

@rick-github commented on GitHub (Apr 7, 2025):

The CLI didn't say 'Added image', so it's not sending the image to the server. Are your CLI and server the same version? What's the output of ollama -v?

For example:

$ ollama -v
ollama version is 0.6.5
$ ollama run mistral-small3.1:24b-instruct-2503-q8_0 "describe the image /home/rick/puppy.png"
Added image '/home/rick/puppy.png'
The image depicts a small, fluffy white puppy sitting on a concrete surface, likely a step or a curb.

@rick-github commented on GitHub (Apr 7, 2025): The CLI didn't say 'Added image', so it's not sending the image to the server. Are your CLI and server the same version? What's the output of `ollama -v`? For example: ```console $ ollama -v ollama version is 0.6.5 $ ollama run mistral-small3.1:24b-instruct-2503-q8_0 "describe the image /home/rick/puppy.png" Added image '/home/rick/puppy.png' The image depicts a small, fluffy white puppy sitting on a concrete surface, likely a step or a curb. ```

GiteaMirror commented

2026-04-22 13:43:13 -05:00

@jmmcd commented on GitHub (Apr 8, 2025):

I'm getting weird behaviour too:

$ ollama run gemma3:4b "what do you see here? ./test.png"
Added image './test.png'
You are absolutely right! My apologies. I misread the image.

The image shows a **group of adorable cartoon animals - a cat, a fox, an owl, an otter,^C

[Voiceover: no, it doesn't]

$ ollama -v
ollama version is 0.6.5

@jmmcd commented on GitHub (Apr 8, 2025): I'm getting weird behaviour too: ``` $ ollama run gemma3:4b "what do you see here? ./test.png" Added image './test.png' You are absolutely right! My apologies. I misread the image. The image shows a **group of adorable cartoon animals - a cat, a fox, an owl, an otter,^C ``` [Voiceover: no, it doesn't] ``` $ ollama -v ollama version is 0.6.5 ```

GiteaMirror commented

2026-04-22 13:43:14 -05:00

@rick-github commented on GitHub (Apr 8, 2025):

Can you add test.png here?

@rick-github commented on GitHub (Apr 8, 2025): Can you add test.png here?

GiteaMirror commented

2026-04-22 13:43:14 -05:00

@jmmcd commented on GitHub (Apr 8, 2025):

It's a dog from a google image search.

@jmmcd commented on GitHub (Apr 8, 2025): It's a dog from a google image search. <img width="305" alt="Image" src="https://github.com/user-attachments/assets/2b197cd3-e882-4342-a253-d0aca683ad4a" />

GiteaMirror commented

2026-04-22 13:43:15 -05:00

@rick-github commented on GitHub (Apr 8, 2025):

$ ollama:0.6.5 run gemma3:4b "what do you see here? ./test.png"
Added image './test.png'
Here's what I see in the image:

*   **A German Shepherd Dog:** A beautiful, classic-looking German Shepherd is the main subject. It's standing on a rocky hillside, with its mouth open as if panting.
*   **Mountainous Landscape:** Behind the dog, there's a dramatic mountain range with snow-capped peaks. The sky is partly cloudy.
*   **Rocky Terrain:** The dog is standing on a rugged, rocky hillside.
*   **Green Vegetation:** There's some green vegetation at the base of the hillside.

It looks like a stunning outdoor shot!

What hardware do you have? Server logs may aid in diagnosis.

@rick-github commented on GitHub (Apr 8, 2025): ```console $ ollama:0.6.5 run gemma3:4b "what do you see here? ./test.png" Added image './test.png' Here's what I see in the image: * **A German Shepherd Dog:** A beautiful, classic-looking German Shepherd is the main subject. It's standing on a rocky hillside, with its mouth open as if panting. * **Mountainous Landscape:** Behind the dog, there's a dramatic mountain range with snow-capped peaks. The sky is partly cloudy. * **Rocky Terrain:** The dog is standing on a rugged, rocky hillside. * **Green Vegetation:** There's some green vegetation at the base of the hillside. It looks like a stunning outdoor shot! ``` What hardware do you have? [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in diagnosis.

GiteaMirror commented

2026-04-22 13:43:15 -05:00

@jmmcd commented on GitHub (Apr 8, 2025):

Ah, ok. The same model gemma3:4b is also giving bad output even dealing with pure text. gemma3:1b (which is text-only) runs fine so I assumed it was a vision issue, but it's not - apologies.

I'm on Mac M2. Yes, there is some error in my logs, only for 4b. Maybe it's too big for my machine.

ggml_metal_graph_compute: command buffer 1 failed with status 5
error: Internal Error (0000000e:Internal Error)

(Full log below)

[GIN] 2025/04/08 - 17:49:14 | 200 |     246.042µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/04/08 - 17:49:14 | 200 |  112.256583ms |       127.0.0.1 | POST     "/api/show"
time=2025-04-08T17:49:14.939+01:00 level=INFO source=sched.go:716 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/jmmcd/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 gpu=0 parallel=4 available=11453251584 required="5.8 GiB"
time=2025-04-08T17:49:14.940+01:00 level=INFO source=server.go:105 msg="system memory" total="16.0 GiB" free="3.5 GiB" free_swap="0 B"
time=2025-04-08T17:49:14.942+01:00 level=INFO source=server.go:138 msg=offload library=metal layers.requested=-1 layers.model=35 layers.offload=35 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.8 GiB" memory.required.partial="5.8 GiB" memory.required.kv="682.0 MiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="1.8 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="517.0 MiB" memory.graph.partial="517.0 MiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-04-08T17:49:15.037+01:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-08T17:49:15.045+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/jmmcd/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 8192 --batch-size 512 --n-gpu-layers 35 --threads 4 --parallel 4 --port 64976"
time=2025-04-08T17:49:15.047+01:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-08T17:49:15.047+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-08T17:49:15.048+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-08T17:49:15.062+01:00 level=INFO source=runner.go:816 msg="starting ollama engine"
time=2025-04-08T17:49:15.063+01:00 level=INFO source=runner.go:879 msg="Server listening on 127.0.0.1:64976"
time=2025-04-08T17:49:15.160+01:00 level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""
time=2025-04-08T17:49:15.160+01:00 level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
time=2025-04-08T17:49:15.160+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=36
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-icelake.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-haswell.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-alderlake.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-sandybridge.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-skylakex.so
time=2025-04-08T17:49:15.165+01:00 level=INFO source=ggml.go:109 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
time=2025-04-08T17:49:15.299+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
time=2025-04-08T17:49:15.306+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=Metal size="3.1 GiB"
time=2025-04-08T17:49:15.307+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="525.0 MiB"
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = false
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
time=2025-04-08T17:49:18.840+01:00 level=INFO source=ggml.go:388 msg="compute graph" backend=Metal buffer_type=Metal
time=2025-04-08T17:49:18.840+01:00 level=INFO source=ggml.go:388 msg="compute graph" backend=CPU buffer_type=CPU
time=2025-04-08T17:49:18.853+01:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-08T17:49:19.069+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.02 seconds"
ggml_metal_graph_compute: command buffer 1 failed with status 5
error: Internal Error (0000000e:Internal Error)
[GIN] 2025/04/08 - 17:49:51 | 200 | 36.334630667s |       127.0.0.1 | POST     "/api/generate"

@jmmcd commented on GitHub (Apr 8, 2025): Ah, ok. The same model `gemma3:4b` is also giving bad output even dealing with pure text. `gemma3:1b` (which is text-only) runs fine so I assumed it was a vision issue, but it's not - apologies. I'm on Mac M2. Yes, there is some error in my logs, only for `4b`. Maybe it's too big for my machine. ``` ggml_metal_graph_compute: command buffer 1 failed with status 5 error: Internal Error (0000000e:Internal Error) ``` (Full log below) ``` [GIN] 2025/04/08 - 17:49:14 | 200 | 246.042µs | 127.0.0.1 | HEAD "/" [GIN] 2025/04/08 - 17:49:14 | 200 | 112.256583ms | 127.0.0.1 | POST "/api/show" time=2025-04-08T17:49:14.939+01:00 level=INFO source=sched.go:716 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/jmmcd/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 gpu=0 parallel=4 available=11453251584 required="5.8 GiB" time=2025-04-08T17:49:14.940+01:00 level=INFO source=server.go:105 msg="system memory" total="16.0 GiB" free="3.5 GiB" free_swap="0 B" time=2025-04-08T17:49:14.942+01:00 level=INFO source=server.go:138 msg=offload library=metal layers.requested=-1 layers.model=35 layers.offload=35 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.8 GiB" memory.required.partial="5.8 GiB" memory.required.kv="682.0 MiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="1.8 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="517.0 MiB" memory.graph.partial="517.0 MiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" time=2025-04-08T17:49:15.037+01:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-08T17:49:15.044+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-08T17:49:15.045+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/jmmcd/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 8192 --batch-size 512 --n-gpu-layers 35 --threads 4 --parallel 4 --port 64976" time=2025-04-08T17:49:15.047+01:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-08T17:49:15.047+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-08T17:49:15.048+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-08T17:49:15.062+01:00 level=INFO source=runner.go:816 msg="starting ollama engine" time=2025-04-08T17:49:15.063+01:00 level=INFO source=runner.go:879 msg="Server listening on 127.0.0.1:64976" time=2025-04-08T17:49:15.160+01:00 level=WARN source=ggml.go:152 msg="key not found" key=general.name default="" time=2025-04-08T17:49:15.160+01:00 level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" time=2025-04-08T17:49:15.160+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=36 ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-icelake.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-haswell.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-alderlake.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-sandybridge.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-skylakex.so time=2025-04-08T17:49:15.165+01:00 level=INFO source=ggml.go:109 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) time=2025-04-08T17:49:15.299+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" time=2025-04-08T17:49:15.306+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=Metal size="3.1 GiB" time=2025-04-08T17:49:15.307+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="525.0 MiB" ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 ggml_metal_init: picking default device: Apple M2 ggml_metal_init: using embedded metal library ggml_metal_init: GPU name: Apple M2 ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction = true ggml_metal_init: simdgroup matrix mul. = true ggml_metal_init: has residency sets = false ggml_metal_init: has bfloat = true ggml_metal_init: use bfloat = false ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB ggml_metal_init: skipping kernel_get_rows_bf16 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported) ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported) ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported) ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported) time=2025-04-08T17:49:18.840+01:00 level=INFO source=ggml.go:388 msg="compute graph" backend=Metal buffer_type=Metal time=2025-04-08T17:49:18.840+01:00 level=INFO source=ggml.go:388 msg="compute graph" backend=CPU buffer_type=CPU time=2025-04-08T17:49:18.853+01:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-08T17:49:18.861+01:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-08T17:49:19.069+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.02 seconds" ggml_metal_graph_compute: command buffer 1 failed with status 5 error: Internal Error (0000000e:Internal Error) [GIN] 2025/04/08 - 17:49:51 | 200 | 36.334630667s | 127.0.0.1 | POST "/api/generate" ```

GiteaMirror commented

2026-04-22 13:43:15 -05:00

@aaricantto commented on GitHub (Apr 13, 2025):

Any Updates on this? I've tried a couple things - and it's only the multi-modal models that aren't playing nice.

# ollama -v
ollama version is 0.6.5

Mine won't even load into the GPU

[GIN] 2025/04/13 - 11:17:42 | 200 |   30.431149ms |       127.0.0.1 | POST     "/api/show"
time=2025-04-13T11:17:43.227Z level=INFO source=sched.go:716 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc gpu=GPU-f96d9f70-dd21-fdd1-d520-324d04f9966c parallel=10 available=50646548480 required="33.9 GiB"
time=2025-04-13T11:17:43.352Z level=INFO source=server.go:105 msg="system memory" total="124.8 GiB" free="121.9 GiB" free_swap="0 B"
time=2025-04-13T11:17:43.482Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[47.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="33.9 GiB" memory.required.partial="33.9 GiB" memory.required.kv="6.2 GiB" memory.required.allocations="[33.9 GiB]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="4.2 GiB" memory.graph.partial="4.2 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB"
time=2025-04-13T11:17:43.482Z level=INFO source=server.go:185 msg="enabling flash attention"
time=2025-04-13T11:17:43.482Z level=WARN source=server.go:193 msg="kv cache type not supported by model" type=""
time=2025-04-13T11:17:43.514Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.rope.freq_scale default=1
time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.vision.longest_edge default=1540
time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06
time=2025-04-13T11:17:43.521Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc --ctx-size 40960 --batch-size 512 --n-gpu-layers 41 --threads 12 --flash-attn --parallel 10 --port 33905"
time=2025-04-13T11:17:43.521Z level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-13T11:17:43.521Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-13T11:17:43.521Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-13T11:17:43.529Z level=INFO source=runner.go:816 msg="starting ollama engine"
time=2025-04-13T11:17:43.529Z level=INFO source=runner.go:879 msg="Server listening on 127.0.0.1:33905"
time=2025-04-13T11:17:43.564Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""
time=2025-04-13T11:17:43.564Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
time=2025-04-13T11:17:43.564Z level=INFO source=ggml.go:67 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=43
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-04-13T11:17:43.612Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-04-13T11:17:43.772Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
time=2025-04-13T11:17:43.845Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="13.9 GiB"
time=2025-04-13T11:17:43.845Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="525.0 MiB"

From here while the model is still "waiting for the server to become available" I run "ollama ps" and see 100% GPU Utilization?

# ollama ps
NAME                    ID              SIZE     PROCESSOR    UNTIL
mistral-small3.1:24b    b9aaf0c2586a    36 GB    100% GPU     4 minutes from now

If I kill the terminal I then see this!

time=2025-04-13T11:20:45.204Z level=WARN source=server.go:587 msg="client connection closed before server finished loading, aborting load"
time=2025-04-13T11:20:45.204Z level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
[GIN] 2025/04/13 - 11:20:45 | 499 |  7.280213019s |       127.0.0.1 | POST     "/api/generate"
time=2025-04-13T11:20:50.213Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.008396224 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-04-13T11:20:50.434Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.229436417 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-04-13T11:20:50.684Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.479803228 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc

@aaricantto commented on GitHub (Apr 13, 2025): Any Updates on this? I've tried a couple things - and it's only the multi-modal models that aren't playing nice. ``` # ollama -v ollama version is 0.6.5 ``` Mine won't even load into the GPU ``` [GIN] 2025/04/13 - 11:17:42 | 200 | 30.431149ms | 127.0.0.1 | POST "/api/show" time=2025-04-13T11:17:43.227Z level=INFO source=sched.go:716 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc gpu=GPU-f96d9f70-dd21-fdd1-d520-324d04f9966c parallel=10 available=50646548480 required="33.9 GiB" time=2025-04-13T11:17:43.352Z level=INFO source=server.go:105 msg="system memory" total="124.8 GiB" free="121.9 GiB" free_swap="0 B" time=2025-04-13T11:17:43.482Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[47.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="33.9 GiB" memory.required.partial="33.9 GiB" memory.required.kv="6.2 GiB" memory.required.allocations="[33.9 GiB]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="4.2 GiB" memory.graph.partial="4.2 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" time=2025-04-13T11:17:43.482Z level=INFO source=server.go:185 msg="enabling flash attention" time=2025-04-13T11:17:43.482Z level=WARN source=server.go:193 msg="kv cache type not supported by model" type="" time=2025-04-13T11:17:43.514Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.rope.freq_scale default=1 time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.vision.longest_edge default=1540 time=2025-04-13T11:17:43.520Z level=WARN source=ggml.go:152 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 time=2025-04-13T11:17:43.521Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc --ctx-size 40960 --batch-size 512 --n-gpu-layers 41 --threads 12 --flash-attn --parallel 10 --port 33905" time=2025-04-13T11:17:43.521Z level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-13T11:17:43.521Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-13T11:17:43.521Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-13T11:17:43.529Z level=INFO source=runner.go:816 msg="starting ollama engine" time=2025-04-13T11:17:43.529Z level=INFO source=runner.go:879 msg="Server listening on 127.0.0.1:33905" time=2025-04-13T11:17:43.564Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default="" time=2025-04-13T11:17:43.564Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" time=2025-04-13T11:17:43.564Z level=INFO source=ggml.go:67 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=43 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-04-13T11:17:43.612Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-04-13T11:17:43.772Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" time=2025-04-13T11:17:43.845Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="13.9 GiB" time=2025-04-13T11:17:43.845Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="525.0 MiB" ``` From here while the model is still "waiting for the server to become available" I run "ollama ps" and see 100% GPU Utilization? ``` # ollama ps NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:24b b9aaf0c2586a 36 GB 100% GPU 4 minutes from now ``` If I kill the terminal I then see this! ``` time=2025-04-13T11:20:45.204Z level=WARN source=server.go:587 msg="client connection closed before server finished loading, aborting load" time=2025-04-13T11:20:45.204Z level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" [GIN] 2025/04/13 - 11:20:45 | 499 | 7.280213019s | 127.0.0.1 | POST "/api/generate" time=2025-04-13T11:20:50.213Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.008396224 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-04-13T11:20:50.434Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.229436417 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-04-13T11:20:50.684Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.479803228 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc ```

GiteaMirror commented

2026-04-22 13:43:16 -05:00

@rick-github commented on GitHub (Apr 13, 2025):

time=2025-04-13T11:20:45.204Z level=WARN source=server.go:587 msg="client connection closed before server finished loading, aborting load"

Client closed the connection before the model had finished loading. Since the bit of the log between 11:17:43 and 11:20:45 wasn't posted, it's a bit hard to see what's going on, but I suspect that either the load hasn't completed, or the client sent a request that changed the model parameters (eg different context size) causing a model eviction and reload. As the client only waited 7.28 seconds before closing, the model hadn't finished loading so the load was aborted.

@rick-github commented on GitHub (Apr 13, 2025): ``` time=2025-04-13T11:20:45.204Z level=WARN source=server.go:587 msg="client connection closed before server finished loading, aborting load" ``` Client closed the connection before the model had finished loading. Since the bit of the log between 11:17:43 and 11:20:45 wasn't posted, it's a bit hard to see what's going on, but I suspect that either the load hasn't completed, or the client sent a request that changed the model parameters (eg different context size) causing a model eviction and reload. As the client only waited 7.28 seconds before closing, the model hadn't finished loading so the load was aborted.

GiteaMirror commented

2026-04-22 13:43:16 -05:00

@aaricantto commented on GitHub (Apr 13, 2025):

Sorry for cutting it off early - I did so to reproduce the error. Even at 14 minutes you can see the model does not load! mistral-small:24b takes seconds to load into the GPU - could there be an issue with my docker-compose configuration?

GPU

RTX A6000 - 48Gb VRAM

My Errors

Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-04-13T13:41:07.726Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-04-13T13:41:07.840Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="13.9 GiB"
time=2025-04-13T13:41:07.840Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="525.0 MiB"
time=2025-04-13T13:41:07.848Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"

Waiting forever...

time=2025-04-13T13:55:33.232Z level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
[GIN] 2025/04/13 - 13:55:33 | 499 |        14m27s |       127.0.0.1 | POST     "/api/generate"
time=2025-04-13T13:55:38.297Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.064602677 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-04-13T13:55:38.516Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.283946761 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-04-13T13:55:38.767Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.534428788 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc

docker-compose.yml

services:
  ollama:
    image: ollama/ollama
    runtime: nvidia  # Ensure NVIDIA runtime is set
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - OLLAMA_FLASH_ATTENTION=true
      - NVIDIA_VISIBLE_DEVICES=all
      - OLLAMA_MAX_QUEUE=10   # Adjust this to the desired number of workers
      - OLLAMA_MAX_LOADED_MODELS=5    # Adjust this to the desired number of models in RAM
      - OLLAMA_NUM_PARALLEL=10
    ports:
      - "11434:11434"
    volumes:
      # Mount the NFS volume to a specific subdirectory
      - ollama_data:/root/.ollama
    networks:
      - icubed-network
    pull_policy: always  # Always pull the latest image


networks:
  icubed-network:
    external: true

volumes:
  ollama_data:
    driver: local
    driver_opts:
      type: "nfs"
      o: "addr=nfs.icubed.com.au,rw,nolock,soft"
      device: ":/Dev/ollama"  # This must be created first on the NFS server

EDIT

Client experiences similar issues with gemma3 vision models - running with docker ollama

@aaricantto commented on GitHub (Apr 13, 2025): Sorry for cutting it off early - I did so to reproduce the error. Even at 14 minutes you can see the model does not load! mistral-small:24b takes seconds to load into the GPU - could there be an issue with my docker-compose configuration? ### GPU RTX A6000 - 48Gb VRAM ### My Errors ``` Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-04-13T13:41:07.726Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-04-13T13:41:07.840Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="13.9 GiB" time=2025-04-13T13:41:07.840Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="525.0 MiB" time=2025-04-13T13:41:07.848Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" ``` Waiting forever... ``` time=2025-04-13T13:55:33.232Z level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" [GIN] 2025/04/13 - 13:55:33 | 499 | 14m27s | 127.0.0.1 | POST "/api/generate" time=2025-04-13T13:55:38.297Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.064602677 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-04-13T13:55:38.516Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.283946761 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-04-13T13:55:38.767Z level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.534428788 model=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc ``` ### docker-compose.yml ``` services: ollama: image: ollama/ollama runtime: nvidia # Ensure NVIDIA runtime is set deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] environment: - OLLAMA_FLASH_ATTENTION=true - NVIDIA_VISIBLE_DEVICES=all - OLLAMA_MAX_QUEUE=10 # Adjust this to the desired number of workers - OLLAMA_MAX_LOADED_MODELS=5 # Adjust this to the desired number of models in RAM - OLLAMA_NUM_PARALLEL=10 ports: - "11434:11434" volumes: # Mount the NFS volume to a specific subdirectory - ollama_data:/root/.ollama networks: - icubed-network pull_policy: always # Always pull the latest image networks: icubed-network: external: true volumes: ollama_data: driver: local driver_opts: type: "nfs" o: "addr=nfs.icubed.com.au,rw,nolock,soft" device: ":/Dev/ollama" # This must be created first on the NFS server ```` ### EDIT Client experiences similar issues with gemma3 vision models - running with docker ollama

GiteaMirror commented

2026-04-22 13:43:17 -05:00

@rick-github commented on GitHub (Apr 13, 2025):

You're loading off a network device. What's the bandwidth of your connection to the NFS server? What does the following show:

docker exec -it ollama dd if=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc of=/dev/null```

@rick-github commented on GitHub (Apr 13, 2025): You're loading off a network device. What's the bandwidth of your connection to the NFS server? What does the following show: ``` docker exec -it ollama dd if=/root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc of=/dev/null```

GiteaMirror commented

2026-04-22 13:43:18 -05:00

@aaricantto commented on GitHub (Apr 13, 2025):

NFS was not the problem - for some reason running the "pull" command managed to fix this!

I'm not sure why though so will leave the thread here in case others have the same problem -> managed to fix both mistral and gemma vision models.

root@d234b833a1d2:/# ollama pull mistral-small3.1:24b
pulling manifest
pulling 1fa8532d986d... 100% ▕██████████████████████▏  15 GB
pulling 6db27cd4e277... 100% ▕██████████████████████▏  695 B
pulling 70a4dab5e1d1... 100% ▕██████████████████████▏ 1.5 KB
pulling a00920c28dfd... 100% ▕██████████████████████▏   17 B
pulling 9b6ac0d4e97e... 100% ▕██████████████████████▏  494 B
verifying sha256 digest
writing manifest
success
root@d234b833a1d2:/# ollama run mistral-small3.1:24b
>>> hello how are you?
Hello! I'm functioning well, thank you. How can I assist you today?

PS: NFS Bandwidth is 500Mbps

PPS: Feel like an idiot for not trying this 2 hours ago

@aaricantto commented on GitHub (Apr 13, 2025): NFS was not the problem - for some reason running the "pull" command managed to fix this! I'm not sure why though so will leave the thread here in case others have the same problem -> managed to fix both mistral and gemma vision models. ``` root@d234b833a1d2:/# ollama pull mistral-small3.1:24b pulling manifest pulling 1fa8532d986d... 100% ▕██████████████████████▏ 15 GB pulling 6db27cd4e277... 100% ▕██████████████████████▏ 695 B pulling 70a4dab5e1d1... 100% ▕██████████████████████▏ 1.5 KB pulling a00920c28dfd... 100% ▕██████████████████████▏ 17 B pulling 9b6ac0d4e97e... 100% ▕██████████████████████▏ 494 B verifying sha256 digest writing manifest success root@d234b833a1d2:/# ollama run mistral-small3.1:24b >>> hello how are you? Hello! I'm functioning well, thank you. How can I assist you today? ``` PS: NFS Bandwidth is 500Mbps PPS: Feel like an idiot for not trying this 2 hours ago

GiteaMirror commented

2026-04-22 13:43:18 -05:00

@rick-github commented on GitHub (Apr 13, 2025):

It's likely that the model is now in page cache and so is not reliant on pulling from the NFS server. If the model is evicted from the page cache, you may experience these loading problems again.

@rick-github commented on GitHub (Apr 13, 2025): It's likely that the model is now in page cache and so is not reliant on pulling from the NFS server. If the model is evicted from the page cache, you may experience these loading problems again.

GiteaMirror commented

2026-04-22 13:43:18 -05:00

@aaricantto commented on GitHub (Apr 13, 2025):

Ahhhh I understand what you're saying now - it's a problem with my docker-swarm setup... I need to dedicate machines / docker volumes for LLM workloads! Thanks!

@aaricantto commented on GitHub (Apr 13, 2025): Ahhhh I understand what you're saying now - it's a problem with my docker-swarm setup... I need to dedicate machines / docker volumes for LLM workloads! Thanks!

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#32433