[GH-ISSUE #13687] qwen2.5vl:3b no longer runs on 8GB GPUs since Ollama 0.13.4 due to compute graph memory estimation #8986

Closed
opened 2026-04-12 21:49:02 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @dav-ctrl on GitHub (Jan 12, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13687

What is the issue?

<html>

qwen2.5vl:3b no longer runs on 8GB GPUs since Ollama 0.13.4 due to compute graph memory estimation

Summary

Starting with Ollama 0.13.4, the model qwen2.5vl:3b no longer runs on GPU on an 8GB VRAM GPU, despite working correctly in Ollama 0.13.3 with identical configuration.

The issue appears to be caused by a significant change in compute graph construction and memory estimation, which increases the required memory from ~1.8 GiB to ~6.7 GiB before layer offloading, causing Ollama to fall back entirely to CPU.

This occurs even when Flash Attention is explicitly disabled.


Environment

  • OS: Windows 11

  • GPU: NVIDIA RTX 4060 Laptop (8 GB VRAM)

  • CUDA compute capability: 8.9

  • Model: qwen2.5vl:3b

  • Quantization: Q4_K_M

  • Context length: 4096

  • Flash Attention: Disabled (OLLAMA_FLASH_ATTENTION=false)

  • Batch size: 512

  • KV cache: default


Ollama Versions Tested

Version | Result -- | -- 0.13.3 | Fully runs on GPU 0.13.4+ (0.13.5 tested) | Falls back to CPU

Observed Behavior (Ollama 0.13.5)

When loading the model, Ollama estimates a large compute graph size and decides not to offload any layers to the GPU:

compute graph device=CPU size="6.7 GiB" model weights device=CPU size="3.2 GiB" total memory size="10.1 GiB" offloaded 0/37 layers to GPU

As a result, the model runs entirely on CPU.


Expected Behavior (Ollama 0.13.3)

With the same model and configuration, Ollama successfully offloads all layers to the GPU:

offloaded 37/37 layers to GPU model weights device=CUDA0 size="3.0 GiB" kv cache device=CUDA0 size="144.0 MiB" compute graph device=CUDA0 size="1.8 GiB" total memory size="5.2 GiB"

The model runs normally on GPU.


Notes

  • The issue reproduces with Flash Attention disabled

  • Reducing context length does not change the behavior

  • Reducing batch size does not change the behavior

  • Limiting GPU layers does not change the behavior

  • The decision to avoid GPU offloading appears to be made before any layers are loaded

  • The regression is reproducible by comparing Ollama 0.13.3 vs 0.13.4+ on the same hardware

</html>

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @dav-ctrl on GitHub (Jan 12, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13687 ### What is the issue? <html> <body> <!--StartFragment--><h2 data-start="212" data-end="330"><code data-start="230" data-end="244">qwen2.5vl:3b</code> no longer runs on 8GB GPUs since Ollama 0.13.4 due to compute graph memory estimation</h2> <h3 data-start="332" data-end="343">Summary</h3> <p data-start="345" data-end="529">Starting with <strong data-start="359" data-end="376">Ollama 0.13.4</strong>, the model <strong data-start="388" data-end="406"><code data-start="390" data-end="404">qwen2.5vl:3b</code></strong> no longer runs on GPU on an <strong data-start="435" data-end="451">8GB VRAM GPU</strong>, despite working correctly in <strong data-start="482" data-end="499">Ollama 0.13.3</strong> with identical configuration.</p> <p data-start="531" data-end="778">The issue appears to be caused by a <strong data-start="567" data-end="641">significant change in compute graph construction and memory estimation</strong>, which increases the required memory from ~1.8 GiB to ~6.7 GiB <strong data-start="705" data-end="732">before layer offloading</strong>, causing Ollama to fall back entirely to CPU.</p> <p data-start="780" data-end="845">This occurs <strong data-start="792" data-end="844">even when Flash Attention is explicitly disabled</strong>.</p> <hr data-start="847" data-end="850"> <h3 data-start="852" data-end="867">Environment</h3> <ul data-start="869" data-end="1181"> <li data-start="869" data-end="891"> <p data-start="871" data-end="891"><strong data-start="871" data-end="877">OS</strong>: Windows 11</p> </li> <li data-start="892" data-end="939"> <p data-start="894" data-end="939"><strong data-start="894" data-end="901">GPU</strong>: NVIDIA RTX 4060 Laptop (8 GB VRAM)</p> </li> <li data-start="940" data-end="976"> <p data-start="942" data-end="976"><strong data-start="942" data-end="969">CUDA compute capability</strong>: 8.9</p> </li> <li data-start="977" data-end="1006"> <p data-start="979" data-end="1006"><strong data-start="979" data-end="988">Model</strong>: <code data-start="990" data-end="1004">qwen2.5vl:3b</code></p> </li> <li data-start="1007" data-end="1035"> <p data-start="1009" data-end="1035"><strong data-start="1009" data-end="1025">Quantization</strong>: Q4_K_M</p> </li> <li data-start="1036" data-end="1064"> <p data-start="1038" data-end="1064"><strong data-start="1038" data-end="1056">Context length</strong>: 4096</p> </li> <li data-start="1065" data-end="1131"> <p data-start="1067" data-end="1131"><strong data-start="1067" data-end="1086">Flash Attention</strong>: Disabled (<code data-start="1098" data-end="1128">OLLAMA_FLASH_ATTENTION=false</code>)</p> </li> <li data-start="1132" data-end="1155"> <p data-start="1134" data-end="1155"><strong data-start="1134" data-end="1148">Batch size</strong>: 512</p> </li> <li data-start="1156" data-end="1181"> <p data-start="1158" data-end="1181"><strong data-start="1158" data-end="1170">KV cache</strong>: default</p> </li> </ul> <hr data-start="1183" data-end="1186"> <h3 data-start="1188" data-end="1214">Ollama Versions Tested</h3> <div class="TyagGW_tableContainer"><div tabindex="-1" class="group TyagGW_tableWrapper flex flex-col-reverse w-fit"> Version | Result -- | -- 0.13.3 | ✅ Fully runs on GPU 0.13.4+ (0.13.5 tested) | ❌ Falls back to CPU </div></div> <hr data-start="1345" data-end="1348"> <h3 data-start="1350" data-end="1387">Observed Behavior (Ollama 0.13.5)</h3> <p data-start="1389" data-end="1506">When loading the model, Ollama estimates a large compute graph size and decides not to offload any layers to the GPU:</p> <pre class="overflow-visible! px-0!" data-start="1508" data-end="1657"><div class="contain-inline-size rounded-2xl corner-superellipse/1.1 relative bg-token-sidebar-surface-primary"><div class="sticky top-[calc(--spacing(9)+var(--header-height))] @w-xl/main:top-9"><div class="absolute end-0 bottom-0 flex h-9 items-center pe-2"><div class="bg-token-bg-elevated-secondary text-token-text-secondary flex items-center gap-4 rounded-sm px-2 font-sans text-xs"></div></div></div><div class="overflow-y-auto p-4" dir="ltr"><code class="whitespace-pre! language-text"><span><span>compute graph device=CPU size="6.7 GiB" model weights device=CPU size="3.2 GiB" total memory size="10.1 GiB" offloaded 0/37 layers to GPU </span></span></code></div></div></pre> <p data-start="1659" data-end="1703">As a result, the model runs entirely on CPU.</p> <hr data-start="1705" data-end="1708"> <h3 data-start="1710" data-end="1747">Expected Behavior (Ollama 0.13.3)</h3> <p data-start="1749" data-end="1839">With the same model and configuration, Ollama successfully offloads all layers to the GPU:</p> <pre class="overflow-visible! px-0!" data-start="1841" data-end="2033"><div class="contain-inline-size rounded-2xl corner-superellipse/1.1 relative bg-token-sidebar-surface-primary"><div class="sticky top-[calc(--spacing(9)+var(--header-height))] @w-xl/main:top-9"><div class="absolute end-0 bottom-0 flex h-9 items-center pe-2"><div class="bg-token-bg-elevated-secondary text-token-text-secondary flex items-center gap-4 rounded-sm px-2 font-sans text-xs"></div></div></div><div class="overflow-y-auto p-4" dir="ltr"><code class="whitespace-pre! language-text"><span><span>offloaded 37/37 layers to GPU model weights device=CUDA0 size="3.0 GiB" kv cache device=CUDA0 size="144.0 MiB" compute graph device=CUDA0 size="1.8 GiB" total memory size="5.2 GiB" </span></span></code></div></div></pre> <p data-start="2035" data-end="2066">The model runs normally on GPU.</p> <hr data-start="2068" data-end="2071"> <h3 data-start="2073" data-end="2082">Notes</h3> <ul data-start="2084" data-end="2484"> <li data-start="2084" data-end="2140"> <p data-start="2086" data-end="2140">The issue reproduces with <strong data-start="2112" data-end="2140">Flash Attention disabled</strong></p> </li> <li data-start="2141" data-end="2195"> <p data-start="2143" data-end="2195">Reducing context length does not change the behavior</p> </li> <li data-start="2196" data-end="2246"> <p data-start="2198" data-end="2246">Reducing batch size does not change the behavior</p> </li> <li data-start="2247" data-end="2297"> <p data-start="2249" data-end="2297">Limiting GPU layers does not change the behavior</p> </li> <li data-start="2298" data-end="2388"> <p data-start="2300" data-end="2388">The decision to avoid GPU offloading appears to be made <strong data-start="2356" data-end="2388">before any layers are loaded</strong></p> </li> <li data-start="2389" data-end="2484"> <p data-start="2391" data-end="2484">The regression is reproducible by comparing Ollama <strong data-start="2442" data-end="2463">0.13.3 vs 0.13.4+</strong> on the same hardware</p></li></ul><!--EndFragment--> </body> </html> ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-12 21:49:02 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 12, 2026):

https://github.com/ollama/ollama/pull/13486

<!-- gh-comment-id:3739604231 --> @rick-github commented on GitHub (Jan 12, 2026): https://github.com/ollama/ollama/pull/13486
Author
Owner

@popkc3 commented on GitHub (Feb 18, 2026):

It’s still not working and the GPU isn’t being used (qwen2.5-vl:7b). Why was this issue closed?

<!-- gh-comment-id:3919461274 --> @popkc3 commented on GitHub (Feb 18, 2026): It’s still not working and the GPU isn’t being used (qwen2.5-vl:7b). Why was this issue closed?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8986