[GH-ISSUE #10161] granite3.2-vision:latest or any other variants not running when trying to use image understanding capabilites #53180

Closed
opened 2026-04-29 02:15:45 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @Amaan1234567 on GitHub (Apr 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10161

ollama_logs.txt

What is the issue?

I am in a team that is working on a chatbot app for KDE with features like terminal command execution and screen understanding, and we use ollama as the backend to run all models, recently granite3.2 was released and it had both vision and tools support, so we decided to use that by default, but recently i have been having trouble running it when giving it images, not just in our app,but even with the cli it just dosent want to run, following is the output i get when i try to use it

describe me this picture /home/amaan/Pictures/Screenshots/Screenshot_20250406_212126.png
Added image '/home/amaan/Pictures/Screenshots/Screenshot_20250406_212126.png'

unanswerable

Send a message (/? for help)

i thought model was corrupted or something, so i nuked the ollama folders and installed again and pulled the model again, but still same issue, the logs are attached below

i did go through the logs and it does show CUDA memory error, but i dont understand how it can face memory isues its a 2B model at int4 quantisation, i have been able to run minicpm multimodals on my system without any issues at all and that is like 4 times bigger than this model. I am hoping someone can help me out with this, cause this one issue has been driving me crazy, cause it only happens on my device somehow, my friends who are also developing this have bigger GPU's and i am guessing thats why they are not facing any issues.

Relevant log output

attached as a txt file cause it was too long to put it here.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.6.4

Originally created by @Amaan1234567 on GitHub (Apr 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10161 [ollama_logs.txt](https://github.com/user-attachments/files/19630563/ollama_logs.txt) ### What is the issue? I am in a team that is working on a chatbot app for KDE with features like terminal command execution and screen understanding, and we use ollama as the backend to run all models, recently granite3.2 was released and it had both vision and tools support, so we decided to use that by default, but recently i have been having trouble running it when giving it images, not just in our app,but even with the cli it just dosent want to run, following is the output i get when i try to use it >>> describe me this picture /home/amaan/Pictures/Screenshots/Screenshot_20250406_212126.png Added image '/home/amaan/Pictures/Screenshots/Screenshot_20250406_212126.png' unanswerable >>> Send a message (/? for help) i thought model was corrupted or something, so i nuked the ollama folders and installed again and pulled the model again, but still same issue, the logs are attached below i did go through the logs and it does show CUDA memory error, but i dont understand how it can face memory isues its a 2B model at int4 quantisation, i have been able to run minicpm multimodals on my system without any issues at all and that is like 4 times bigger than this model. I am hoping someone can help me out with this, cause this one issue has been driving me crazy, cause it only happens on my device somehow, my friends who are also developing this have bigger GPU's and i am guessing thats why they are not facing any issues. ### Relevant log output attached as a txt file cause it was too long to put it here. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.4
GiteaMirror added the bug label 2026-04-29 02:15:45 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 7, 2025):

Apr 07 10:18:52 192.168.1.12 ollama[15109]: time=2025-04-07T10:18:52.396+05:30 level=INFO source=server.go:138 msg=offload
 library=cuda layers.requested=-1 layers.model=35 layers.offload=4 layers.split="" memory.available="[3.6 GiB]"
 memory.gpu_overhead="0 B" memory.required.full="5.9 GiB" memory.required.partial="3.6 GiB" memory.required.kv="214.0 MiB"
 memory.required.allocations="[3.6 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="1.8 GiB"
 memory.weights.nonrepeating="525.0 MiB" memory.graph.full="517.0 MiB" memory.graph.partial="1.0 GiB"
 projector.weights="795.9 MiB" projector.graph="1.0 GiB"

ollama is offloading 4 of 35 layers, using 3.6GiB of the available 3.6GiB, ie everything. A temporary allocation during inference is causing an OOM. When running minicpm, the different size of the layers means there's a little more free space in the VRAM.
See here for ways to mitigate this.

<!-- gh-comment-id:2782933368 --> @rick-github commented on GitHub (Apr 7, 2025): ``` Apr 07 10:18:52 192.168.1.12 ollama[15109]: time=2025-04-07T10:18:52.396+05:30 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=35 layers.offload=4 layers.split="" memory.available="[3.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.9 GiB" memory.required.partial="3.6 GiB" memory.required.kv="214.0 MiB" memory.required.allocations="[3.6 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="1.8 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="517.0 MiB" memory.graph.partial="1.0 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" ``` ollama is offloading 4 of 35 layers, using 3.6GiB of the available 3.6GiB, ie everything. A temporary allocation during inference is causing an OOM. When running minicpm, the different size of the layers means there's a little more free space in the VRAM. See [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288) for ways to mitigate this.
Author
Owner

@Amaan1234567 commented on GitHub (Apr 7, 2025):

damn that was a fast reply, i have a doubt, are these environment variables or are they options that i can set like are these under the same category as num_ctx etc, which we can set when doing a curl to the ollama server

<!-- gh-comment-id:2782952023 --> @Amaan1234567 commented on GitHub (Apr 7, 2025): damn that was a fast reply, i have a doubt, are these environment variables or are they options that i can set like are these under the same category as num_ctx etc, which we can set when doing a curl to the ollama server
Author
Owner

@rick-github commented on GitHub (Apr 7, 2025):

OLLAMA_GPU_OVERHEAD, OLLAMA_FLASH_ATTENTION, GGML_CUDA_ENABLE_UNIFIED_MEMORY, OLLAMA_NUM_PARALLEL are environment variables. num_gpu is an API parameter like num_ctx. num_ctx is both in that you can also set OLLAMA_CONTEXT_LENGTH.

<!-- gh-comment-id:2782962055 --> @rick-github commented on GitHub (Apr 7, 2025): `OLLAMA_GPU_OVERHEAD`, `OLLAMA_FLASH_ATTENTION`, `GGML_CUDA_ENABLE_UNIFIED_MEMORY`, `OLLAMA_NUM_PARALLEL` are environment variables. `num_gpu` is an API parameter like `num_ctx`. `num_ctx` is both in that you can also set `OLLAMA_CONTEXT_LENGTH`.
Author
Owner

@Amaan1234567 commented on GitHub (Apr 7, 2025):

ok thanks man, appreciate the quick response, i was honestly preparing my self to wait for days to get something, usually dont get this fast of a response on other repo's

<!-- gh-comment-id:2782967334 --> @Amaan1234567 commented on GitHub (Apr 7, 2025): ok thanks man, appreciate the quick response, i was honestly preparing my self to wait for days to get something, usually dont get this fast of a response on other repo's
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53180