[GH-ISSUE #14137] Regression: v0.15.5 caps Qwen 3 context at 40k (was 128k in v0.15.4) due to failed Q4 KV Cache negotiation #9222

Closed
opened 2026-04-12 22:05:19 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @wintsworks on GitHub (Feb 7, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14137

What is the issue?

I am running qwen3:8b on an NVIDIA RTX 3060 (12GB). In version 0.15.4, I could run the full 128k context entirely on the GPU (100% offload) by forcing OLLAMA_KV_CACHE_TYPE=q4_0.

Regression in Ability

After updating to 0.15.5, the model strictly caps at 40,960 context. The logs indicate that the new version is failing to apply Q4 KV cache (falling back to F16) or incorrectly flagging Flash Attention as unsupported for this architecture, forcing the memory allocator to aggressively cap the context to fit in VRAM.

In v0.15.5, the logs show the following warning which does not appear (or is ignored) in v0.15.4:
level=WARN source=server.go:257 msg="quantized kv cache requested but flash attention disabled" type=q4_0

(Note: I have explicitly set OLLAMA_FLASH_ATTENTION=1, so this indicates the backend is overriding my setting or failing a compatibility check for Qwen 3).

Comparison:

  • v0.15.4: Allocates ~2.5GB for 128k context (Q4 cache active). Status: 128,000 tokens, 100% GPU.

  • v0.15.5: Disables Q4 cache -> Calculates 128k context requires ~10GB (F16) -> Detects insufficient VRAM -> Caps context at ~6GB limit. Status: 40,960 tokens.

  • Users on consumer hardware (12GB/16GB cards) can no longer run large context windows on Qwen 3 because the memory efficiency features (Flash Attn / Q4 Cache) are silently disabled.

  • Using an rtx 3060 that worked just fine with flash attention beforehand.

Systemd Environmental Variables

My custom override config for the systemd unit file:
darren@code:~$ cat /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="PATH=/usr/local/cuda/bin:/usr/bin:/bin"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_PLUGINS=true"
Environment="OLLAMA_GPU_OVERHEAD=0"
Environment="OLLAMA_NUM_GPU=999"
User=darren
Group=ollama

Relevant log output

Feb 07 08:36:30 code.ai ollama[9442]: time=2026-02-07T08:36:30.223-06:00 level=INFO source=images.go:480 msg="total unused blobs removed: 0"

Feb 07 08:36:30 code.ai ollama[9442]: time=2026-02-07T08:36:30.227-06:00 level=INFO source=routes.go:1689 msg="Listening on 0.0.0.0:11434 (version 0.15.5)"

Feb 07 08:36:30 code.ai ollama[9442]: time=2026-02-07T08:36:30.229-06:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."

Feb 07 08:36:30 code.ai ollama[9442]: time=2026-02-07T08:36:30.233-06:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 41003"

Feb 07 08:36:30 code.ai ollama[9442]: time=2026-02-07T08:36:30.770-06:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 41371"

Feb 07 08:36:31 code.ai ollama[9442]: time=2026-02-07T08:36:31.053-06:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"

Feb 07 08:36:31 code.ai ollama[9442]: time=2026-02-07T08:36:31.058-06:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 34995"

Feb 07 08:36:31 code.ai ollama[9442]: time=2026-02-07T08:36:31.058-06:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43585"

Feb 07 08:36:31 code.ai ollama[9442]: time=2026-02-07T08:36:31.254-06:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-0208be3a-fb12-c2fd-6a09-e317a426dbe5 filter_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3060" libdirs=ollama,cuda_v13 driver=13.1 pci_id=0000:01:00.0 type=discrete total="12.0 GiB" available="11.6 GiB"

Feb 07 08:36:31 code.ai ollama[9442]: time=2026-02-07T08:36:31.255-06:00 level=INFO source=routes.go:1739 msg="vram-based default context" total_vram="12.0 GiB" default_num_ctx=4096

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.15.5

Originally created by @wintsworks on GitHub (Feb 7, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14137 ### What is the issue? I am running qwen3:8b on an NVIDIA RTX 3060 (12GB). In version 0.15.4, I could run the full 128k context entirely on the GPU (100% offload) by forcing OLLAMA_KV_CACHE_TYPE=q4_0. ### Regression in Ability After updating to 0.15.5, the model strictly caps at 40,960 context. The logs indicate that the new version is failing to apply Q4 KV cache (falling back to F16) or incorrectly flagging Flash Attention as unsupported for this architecture, forcing the memory allocator to aggressively cap the context to fit in VRAM. In v0.15.5, the logs show the following warning which does not appear (or is ignored) in v0.15.4: level=WARN source=server.go:257 msg="quantized kv cache requested but flash attention disabled" type=q4_0 (Note: I have explicitly set OLLAMA_FLASH_ATTENTION=1, so this indicates the backend is overriding my setting or failing a compatibility check for Qwen 3). ### Comparison: - v0.15.4: Allocates ~2.5GB for 128k context (Q4 cache active). Status: 128,000 tokens, 100% GPU. - v0.15.5: Disables Q4 cache -> Calculates 128k context requires ~10GB (F16) -> Detects insufficient VRAM -> Caps context at ~6GB limit. Status: 40,960 tokens. - Users on consumer hardware (12GB/16GB cards) can no longer run large context windows on Qwen 3 because the memory efficiency features (Flash Attn / Q4 Cache) are silently disabled. - Using an rtx 3060 that worked just fine with flash attention beforehand. ### Systemd Environmental Variables My custom override config for the systemd unit file: darren@code:~$ cat /etc/systemd/system/ollama.service.d/override.conf [Service] Environment="PATH=/usr/local/cuda/bin:/usr/bin:/bin" Environment="OLLAMA_HOST=0.0.0.0:11434" Environment="OLLAMA_NUM_PARALLEL=1" Environment="OLLAMA_FLASH_ATTENTION=1" Environment="OLLAMA_KV_CACHE_TYPE=q4_0" Environment="OLLAMA_KEEP_ALIVE=-1" Environment="OLLAMA_ORIGINS=*" Environment="OLLAMA_PLUGINS=true" Environment="OLLAMA_GPU_OVERHEAD=0" Environment="OLLAMA_NUM_GPU=999" User=darren Group=ollama ### Relevant log output ```shell Feb 07 08:36:30 code.ai ollama[9442]: time=2026-02-07T08:36:30.223-06:00 level=INFO source=images.go:480 msg="total unused blobs removed: 0" Feb 07 08:36:30 code.ai ollama[9442]: time=2026-02-07T08:36:30.227-06:00 level=INFO source=routes.go:1689 msg="Listening on 0.0.0.0:11434 (version 0.15.5)" Feb 07 08:36:30 code.ai ollama[9442]: time=2026-02-07T08:36:30.229-06:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." Feb 07 08:36:30 code.ai ollama[9442]: time=2026-02-07T08:36:30.233-06:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 41003" Feb 07 08:36:30 code.ai ollama[9442]: time=2026-02-07T08:36:30.770-06:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 41371" Feb 07 08:36:31 code.ai ollama[9442]: time=2026-02-07T08:36:31.053-06:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled. To enable, set OLLAMA_VULKAN=1" Feb 07 08:36:31 code.ai ollama[9442]: time=2026-02-07T08:36:31.058-06:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 34995" Feb 07 08:36:31 code.ai ollama[9442]: time=2026-02-07T08:36:31.058-06:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43585" Feb 07 08:36:31 code.ai ollama[9442]: time=2026-02-07T08:36:31.254-06:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-0208be3a-fb12-c2fd-6a09-e317a426dbe5 filter_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3060" libdirs=ollama,cuda_v13 driver=13.1 pci_id=0000:01:00.0 type=discrete total="12.0 GiB" available="11.6 GiB" Feb 07 08:36:31 code.ai ollama[9442]: time=2026-02-07T08:36:31.255-06:00 level=INFO source=routes.go:1739 msg="vram-based default context" total_vram="12.0 GiB" default_num_ctx=4096 ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.15.5
GiteaMirror added the bug label 2026-04-12 22:05:19 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 7, 2026):

I am running qwen3:8b on an NVIDIA RTX 3060 (12GB). In version 0.15.4, I could run the full 128k context entirely on the GPU (100% offload) by forcing OLLAMA_KV_CACHE_TYPE=q4_0.

qwen3:8b does not support a context length of 128k. What's changed is that ollama ps in 0.15.5 now accurately displays the context length that the model is running with.

$ ollama show qwen3:8b
  Model
    architecture        qwen3     
    parameters          8.2B      
    context length      40960     # <-- maximum context that the model can use
    embedding length    4096      
    quantization        Q4_K_M    

  Capabilities
    completion    
    tools         
    thinking      

ollama 0.15.4:

$ OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 OLLAMA_CONTEXT_LENGTH=131072 ollama serve
$ ollama run qwen3:8b ''
$ ollama ps
NAME        ID              SIZE      PROCESSOR    CONTEXT    UNTIL   
qwen3:8b    500a1f067a9f    7.3 GB    100% GPU     131072     Forever    
$ nvidia-smi | grep ollama
|    0   N/A  N/A         3342861      C   /usr/bin/ollama                        6898MiB |

ollama 0.15.5:

$ OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 OLLAMA_CONTEXT_LENGTH=131072 ollama serve
$ ollama run qwen3:8b ''
$ ollama ps
NAME        ID              SIZE      PROCESSOR    CONTEXT    UNTIL   
qwen3:8b    500a1f067a9f    7.3 GB    100% GPU     40960      Forever    
$ nvidia-smi | grep ollama
|    0   N/A  N/A         3344558      C   /usr/bin/ollama                        6898MiB |

OLLAMA_PLUGINS is not an ollama configuration variable.

<!-- gh-comment-id:3864754331 --> @rick-github commented on GitHub (Feb 7, 2026): > I am running qwen3:8b on an NVIDIA RTX 3060 (12GB). In version 0.15.4, I could run the full 128k context entirely on the GPU (100% offload) by forcing OLLAMA_KV_CACHE_TYPE=q4_0. qwen3:8b does not support a context length of 128k. What's changed is that `ollama ps` in 0.15.5 now [accurately displays](https://github.com/ollama/ollama/commit/d11fbd2c603aad64535c5cecd6ee68a02d01aa0c) the context length that the model is running with. ```console $ ollama show qwen3:8b Model architecture qwen3 parameters 8.2B context length 40960 # <-- maximum context that the model can use embedding length 4096 quantization Q4_K_M Capabilities completion tools thinking ``` ollama 0.15.4: ```console $ OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 OLLAMA_CONTEXT_LENGTH=131072 ollama serve $ ollama run qwen3:8b '' $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3:8b 500a1f067a9f 7.3 GB 100% GPU 131072 Forever $ nvidia-smi | grep ollama | 0 N/A N/A 3342861 C /usr/bin/ollama 6898MiB | ``` ollama 0.15.5: ```console $ OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 OLLAMA_CONTEXT_LENGTH=131072 ollama serve $ ollama run qwen3:8b '' $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3:8b 500a1f067a9f 7.3 GB 100% GPU 40960 Forever $ nvidia-smi | grep ollama | 0 N/A N/A 3344558 C /usr/bin/ollama 6898MiB | ``` `OLLAMA_PLUGINS` is not an ollama configuration variable.
Author
Owner

@wintsworks commented on GitHub (Feb 7, 2026):

I am running qwen3:8b on an NVIDIA RTX 3060 (12GB). In version 0.15.4, I could run the full 128k context entirely on the GPU (100% offload) by forcing OLLAMA_KV_CACHE_TYPE=q4_0.

qwen3:8b does not support a context length of 128k. What's changed is that ollama ps in 0.15.5 now accurately displays the context length that the model is running with.

$ ollama show qwen3:8b
Model
architecture qwen3
parameters 8.2B
context length 40960 # <-- maximum context that the model can use
embedding length 4096
quantization Q4_K_M

Capabilities
completion
tools
thinking

ollama 0.15.4:

$ OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 OLLAMA_CONTEXT_LENGTH=131072 ollama serve
$ ollama run qwen3:8b ''
$ ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3:8b 500a1f067a9f 7.3 GB 100% GPU 131072 Forever
$ nvidia-smi | grep ollama
| 0 N/A N/A 3342861 C /usr/bin/ollama 6898MiB |

ollama 0.15.5:

$ OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 OLLAMA_CONTEXT_LENGTH=131072 ollama serve
$ ollama run qwen3:8b ''
$ ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3:8b 500a1f067a9f 7.3 GB 100% GPU 40960 Forever
$ nvidia-smi | grep ollama
| 0 N/A N/A 3344558 C /usr/bin/ollama 6898MiB |

OLLAMA_PLUGINS is not an ollama configuration variable.

Thank you, I wouldve never figured it out on my own so thanks for saving me hours of diagnosing. Have a good one.

<!-- gh-comment-id:3865112891 --> @wintsworks commented on GitHub (Feb 7, 2026): > > I am running qwen3:8b on an NVIDIA RTX 3060 (12GB). In version 0.15.4, I could run the full 128k context entirely on the GPU (100% offload) by forcing OLLAMA_KV_CACHE_TYPE=q4_0. > > qwen3:8b does not support a context length of 128k. What's changed is that `ollama ps` in 0.15.5 now [accurately displays](https://github.com/ollama/ollama/commit/d11fbd2c603aad64535c5cecd6ee68a02d01aa0c) the context length that the model is running with. > > $ ollama show qwen3:8b > Model > architecture qwen3 > parameters 8.2B > context length 40960 # <-- maximum context that the model can use > embedding length 4096 > quantization Q4_K_M > > Capabilities > completion > tools > thinking > > ollama 0.15.4: > > $ OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 OLLAMA_CONTEXT_LENGTH=131072 ollama serve > $ ollama run qwen3:8b '' > $ ollama ps > NAME ID SIZE PROCESSOR CONTEXT UNTIL > qwen3:8b 500a1f067a9f 7.3 GB 100% GPU 131072 Forever > $ nvidia-smi | grep ollama > | 0 N/A N/A 3342861 C /usr/bin/ollama 6898MiB | > > ollama 0.15.5: > > $ OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 OLLAMA_CONTEXT_LENGTH=131072 ollama serve > $ ollama run qwen3:8b '' > $ ollama ps > NAME ID SIZE PROCESSOR CONTEXT UNTIL > qwen3:8b 500a1f067a9f 7.3 GB 100% GPU 40960 Forever > $ nvidia-smi | grep ollama > | 0 N/A N/A 3344558 C /usr/bin/ollama 6898MiB | > > `OLLAMA_PLUGINS` is not an ollama configuration variable. Thank you, I wouldve never figured it out on my own so thanks for saving me hours of diagnosing. Have a good one.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9222