[GH-ISSUE #11465] Conflict between: Consumer/Laptop VRAM Quantity, qwen2.5vl:7b-q4_K_M, OLLAMA_KV_CACHE_TYPE . #33328

Closed
opened 2026-04-22 15:54:11 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @mirage335 on GitHub (Jul 18, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11465

What is the issue?

Attempting to run qwen2.5vl:7b-q4_K_M as a vision encoder apparently works if OLLAMA_KV_CACHE_TYPE=f16 , OLLAMA_KV_CACHE_TYPE=q8_0 , but not OLLAMA_KV_CACHE_TYPE=q4_0 .

Practically, when using ollama with OpenWebUI, using systemd or Windows Startup, etc (ie. not a custom shell script) to start ollama, this necessitates running all ollama models at q8_0 or greater KV_CACHE quantization, reducing the useful size or context window of all other AI LLM models, etc, just to get the vision model working. (another reason to not run multiple ollama servers is this may lead to VRAM contention issues due to non-cooperation between the instances when both try to load a model... sending inference to CPU...)

One obvious solution would be to provide some configuration option or environment variable, not necessarily in the Modelfile . Maybe a comma and semicolon delimited list of models and qunatizations in an environment variable, such as:

OLLAMA_KV_CACHE_TYPE_SPECIAL=qwen2.5vl:7b-q4_K_M,q8_0:q4_0;

Similar to https://github.com/ollama/ollama/issues/10794 , but here this would be optional overrides in addition to the OLLAMA_KV_CACHE_TYPE variable.

This does hinder what seems to be the goal of ollama - helping to get a variety of local AI LLM models conveniently working on a single computer.

Another useful improvement would be for ollama to, except in cases when the model itself is definitely fp16 or VRAM is absurdly plentiful compared to model size, etc,, default to q4_0 KV_CACHE . I think it's fair to say that for the vast majority of practical uses, f16 KV_CACHE or even q8_0 is usually very excessive.

Relevant log output


OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

ollama version is 0.9.7-rc0

Originally created by @mirage335 on GitHub (Jul 18, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11465 ### What is the issue? Attempting to run qwen2.5vl:7b-q4_K_M as a vision encoder apparently works if OLLAMA_KV_CACHE_TYPE=f16 , OLLAMA_KV_CACHE_TYPE=q8_0 , but **not** OLLAMA_KV_CACHE_TYPE=q4_0 . Practically, when using ollama with OpenWebUI, using systemd or Windows Startup, etc (ie. not a custom shell script) to start ollama, this necessitates running _all_ ollama models at q8_0 or greater KV_CACHE quantization, reducing the useful size or context window of all other AI LLM models, etc, just to get the vision model working. (another reason to not run multiple ollama servers is this may lead to VRAM contention issues due to non-cooperation between the instances when both try to load a model... sending inference to CPU...) One obvious solution would be to provide some configuration option or environment variable, not necessarily in the Modelfile . Maybe a comma and semicolon delimited list of models and qunatizations in an environment variable, such as: > OLLAMA_KV_CACHE_TYPE_SPECIAL=qwen2.5vl:7b-q4_K_M,q8_0:q4_0; Similar to https://github.com/ollama/ollama/issues/10794 , but here this would be optional overrides in addition to the OLLAMA_KV_CACHE_TYPE variable. This does hinder what seems to be the goal of ollama - helping to get a variety of local AI LLM models conveniently working on a single computer. Another useful improvement would be for ollama to, except in cases when the model itself is definitely fp16 or VRAM is absurdly plentiful compared to model size, etc,, default to q4_0 KV_CACHE . I think it's fair to say that for the vast majority of practical uses, f16 KV_CACHE or even q8_0 is usually very excessive. ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version ollama version is 0.9.7-rc0
GiteaMirror added the bug label 2026-04-22 15:54:11 -05:00
Author
Owner

@mirage335 commented on GitHub (Jan 10, 2026):

Well, the workaround is now simply to use Qwen3 VL.

https://ollama.com/mirage335/Qwen-3-VL-8B-Instruct-virtuoso

<!-- gh-comment-id:3732716645 --> @mirage335 commented on GitHub (Jan 10, 2026): Well, the workaround is now simply to use Qwen3 VL. [https://ollama.com/mirage335/Qwen-3-VL-8B-Instruct-virtuoso](https://ollama.com/mirage335/Qwen-3-VL-8B-Instruct-virtuoso)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33328