[GH-ISSUE #11997] Unused modell is not unloaded from VRAM with bigger num_ctx #7966

New Issue

GiteaMirror · 2026-04-12T20:08:49-05:00

GiteaMirror commented

2026-04-12 20:08:49 -05:00

Originally created by @somera on GitHub (Aug 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11997

What is the issue?

I have one RTX A4000 with 48GB VRAM. These models:

NAME               ID              SIZE     PROCESSOR    CONTEXT    UNTIL
qwen3-coder:30b    ad67f85ca250    20 GB    100% GPU     4096       4 minutes from now
gpt-oss:20b        aa4295ac10c3    14 GB    100% GPU     8192       4 minutes from now

can both be loaded into VRAM and used in parallel with a small num_ctx.

I'm working with Open WebUI v0.6.22.

I have a small Python project with several files, ~1800 lines in total. When I paste this into my Open WebUI chat with num_ctx=42000 using the gpt-oss:20b model, everything works fine. The review is generated, and no other models are loaded into VRAM.

However, when I use qwen3-coder:30b, then both models remain loaded in VRAM with the status "4 minutes from now".

When I repeat the test with the ~1800 lines of Python code and num_ctx=42000, the request is sent to Ollama and starts processing. The memory usage grows with the larger num_ctx.

My expectation is that the unused model (qwen3-coder:30b in this case) would be unloaded automatically so Ollama could finish the job. But instead, Ollama stops working on the prompt because there isn’t enough free VRAM for both qwen3-coder:30b and gpt-oss:20b with num_ctx=42000. I have to wait until the unused model (qwen3-coder:30b) is eventually unloaded from VRAM before the request can be processed again.

Is my expectation wrong or is there a bug?

Cause unloading unused models works fine, when the num_ctx is small (4096). I see this when I'm using my preload model script:

$ ./ollama_preload_models_v7.sh
✅ Ollama is running.
⏭️ Skipping mxbai-embed-large:latest (explicitly excluded)
⏭️ Skipping nomic-embed-text:latest (explicitly excluded)
⏭️ Skipping deepseek-coder-v2:16b (explicitly excluded)
⏭️ Skipping bge-m3:latest (explicitly excluded)
Preloading llama3.2:3b (2.0 GB, 2.0 GB)...
I'm ready when you are. What's on your mind?

✅ Done in 2 seconds.
Preloading phi4-mini:3.8b (2.5 GB, 2.5 GB)...
Always ready to assist you! What can I help with today?

✅ Done in 2 seconds.
Preloading llama2-uncensored:latest (3.8 GB, 3.7 GB)...
Ready.

✅ Done in 1 seconds.
Preloading mistral:7b-instruct (4.4 GB, 4.3 GB)...
 Absolutely! I'm here to help. What do you need assistance with today?

✅ Done in 2 seconds.
Preloading llama3.2-vision:11b (7.8 GB, 7.7 GB)...
Go ahead!

✅ Done in 3 seconds.
Preloading deepseek-coder-v2-fixed:latest (8.9 GB, 8.8 GB)...
 Yes, I am ready. How can I assist you today?

✅ Done in 3 seconds.
Preloading phi4:latest (9.1 GB, 9.0 GB)...
Absolutely! I'm ready to help. What do you need assistance with today? Whether it's answering questions, providing information, or helping brainstorm ideas, just let me
know how I can assist you. 😊

...

✅ Done in 12 seconds.
Preloading qwen3-coder:30b (18 GB, 18.0 GB)...
Ready! I'm here and ready to help with whatever you'd like to discuss or work on. What can I assist you with today?

✅ Done in 5 seconds.
Preloading deepseek-r1:32b (19 GB, 19.0 GB)...
Yes, I'm ready! How can I assist you today?

✅ Done in 6 seconds.
All models processed.

Relevant log output

No logs

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.11.5

Originally created by @somera on GitHub (Aug 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11997 ### What is the issue? I have one RTX A4000 with 48GB VRAM. These models: ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3-coder:30b ad67f85ca250 20 GB 100% GPU 4096 4 minutes from now gpt-oss:20b aa4295ac10c3 14 GB 100% GPU 8192 4 minutes from now ``` can both be loaded into VRAM and used in parallel with a small `num_ctx`. I'm working with **Open WebUI v0.6.22**. I have a small Python project with several files, ~1800 lines in total. When I paste this into my Open WebUI chat with `num_ctx=42000` using the `gpt-oss:20b` model, everything works fine. The review is generated, and no other models are loaded into VRAM. However, when I use `qwen3-coder:30b`, then both models remain loaded in VRAM with the status "4 minutes from now". When I repeat the test with the ~1800 lines of Python code and `num_ctx=42000`, the request is sent to Ollama and starts processing. The memory usage grows with the larger `num_ctx`. My expectation is that the unused model (`qwen3-coder:30b` in this case) would be unloaded automatically so Ollama could finish the job. But instead, Ollama stops working on the prompt because there isn’t enough free VRAM for both `qwen3-coder:30b` and `gpt-oss:20b` with `num_ctx=42000`. I have to wait until the unused model (`qwen3-coder:30b`) is eventually unloaded from VRAM before the request can be processed again. Is my expectation wrong or is there a bug? Cause unloading unused models works fine, when the num_ctx is small (4096). I see this when I'm using my preload model script: ``` $ ./ollama_preload_models_v7.sh ✅ Ollama is running. ⏭️ Skipping mxbai-embed-large:latest (explicitly excluded) ⏭️ Skipping nomic-embed-text:latest (explicitly excluded) ⏭️ Skipping deepseek-coder-v2:16b (explicitly excluded) ⏭️ Skipping bge-m3:latest (explicitly excluded) Preloading llama3.2:3b (2.0 GB, 2.0 GB)... I'm ready when you are. What's on your mind? ✅ Done in 2 seconds. Preloading phi4-mini:3.8b (2.5 GB, 2.5 GB)... Always ready to assist you! What can I help with today? ✅ Done in 2 seconds. Preloading llama2-uncensored:latest (3.8 GB, 3.7 GB)... Ready. ✅ Done in 1 seconds. Preloading mistral:7b-instruct (4.4 GB, 4.3 GB)... Absolutely! I'm here to help. What do you need assistance with today? ✅ Done in 2 seconds. Preloading llama3.2-vision:11b (7.8 GB, 7.7 GB)... Go ahead! ✅ Done in 3 seconds. Preloading deepseek-coder-v2-fixed:latest (8.9 GB, 8.8 GB)... Yes, I am ready. How can I assist you today? ✅ Done in 3 seconds. Preloading phi4:latest (9.1 GB, 9.0 GB)... Absolutely! I'm ready to help. What do you need assistance with today? Whether it's answering questions, providing information, or helping brainstorm ideas, just let me know how I can assist you. 😊 ... ✅ Done in 12 seconds. Preloading qwen3-coder:30b (18 GB, 18.0 GB)... Ready! I'm here and ready to help with whatever you'd like to discuss or work on. What can I assist you with today? ✅ Done in 5 seconds. Preloading deepseek-r1:32b (19 GB, 19.0 GB)... Yes, I'm ready! How can I assist you today? ✅ Done in 6 seconds. All models processed. ``` ### Relevant log output ```shell No logs ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.11.5

GiteaMirror added the bug label 2026-04-12 20:08:49 -05:00

GiteaMirror closed this issue

2026-04-12 20:08:50 -05:00

GiteaMirror commented

2026-04-12 20:08:51 -05:00

@jessegross commented on GitHub (Aug 20, 2025):

Can you please post the server logs with OLLAMA_DEBUG=1 set? Does it happen every time with the sequence of events that triggers it?

@jessegross commented on GitHub (Aug 20, 2025): Can you please post the [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) with OLLAMA_DEBUG=1 set? Does it happen every time with the sequence of events that triggers it?

GiteaMirror commented

2026-04-12 20:08:51 -05:00

@somera commented on GitHub (Aug 20, 2025):

@jessegross now I'm off. I'll send the logs tomorrow.

@somera commented on GitHub (Aug 20, 2025): @jessegross now I'm off. I'll send the logs tomorrow.

GiteaMirror commented

2026-04-12 20:08:51 -05:00

@rick-github commented on GitHub (Aug 20, 2025):

gpt-oss:2ob with a context of 42000 only takes 20G, so there's plenty of room for gpt-oss:20b and qwen3-coder:30b to co-reside in 48G.

NAME               ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
qwen3-coder:30b    ad67f85ca250    20 GB    100% GPU     4096       Forever    
gpt-oss:20b        aa4295ac10c3    20 GB    100% GPU     42000      Forever

Server logs may aid in debugging.

@rick-github commented on GitHub (Aug 20, 2025): gpt-oss:2ob with a context of 42000 only takes 20G, so there's plenty of room for gpt-oss:20b and qwen3-coder:30b to co-reside in 48G. ```console NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3-coder:30b ad67f85ca250 20 GB 100% GPU 4096 Forever gpt-oss:20b aa4295ac10c3 20 GB 100% GPU 42000 Forever ``` [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.

GiteaMirror commented

2026-04-12 20:08:52 -05:00

@somera commented on GitHub (Aug 20, 2025):

First run:

NAME             ID              SIZE      PROCESSOR    CONTEXT    UNTIL
gpt-oss:20b      aa4295ac10c3    20 GB     100% GPU     42000      4 minutes from now
bge-m3:latest    790764642607    1.7 GB    100% GPU     4096       4 minutes from now

But nvi-top shows during processing:

And 74,9% = 35,952GB plus 19GB for qwen3-coder:30b is more than available 48GB.

@rick-github how can you explain this?

@somera commented on GitHub (Aug 20, 2025): First run: ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b aa4295ac10c3 20 GB 100% GPU 42000 4 minutes from now bge-m3:latest 790764642607 1.7 GB 100% GPU 4096 4 minutes from now ``` But nvi-top shows during processing: <img width="1087" height="452" alt="Image" src="https://github.com/user-attachments/assets/c6222ba3-04fe-40b5-b1ff-bb25f3640475" /> And 74,9% = 35,952GB plus 19GB for `qwen3-coder:30b` is more than available 48GB. @rick-github how can you explain this?

GiteaMirror commented

2026-04-12 20:08:53 -05:00

@somera commented on GitHub (Aug 20, 2025):

And than ..

NAME               ID              SIZE      PROCESSOR    CONTEXT    UNTIL
gpt-oss:20b        aa4295ac10c3    14 GB     100% GPU     8192       3 minutes from now
bge-m3:latest      790764642607    1.7 GB    100% GPU     4096       2 minutes from now
qwen3-coder:30b    ad67f85ca250    20 GB     100% GPU     4096       About a minute from now

the used VRAM frows to ~91-95% and ollama stopped.

In Open WebUI I see this:

And here are the ollama logs:

ollama.log.gz

@somera commented on GitHub (Aug 20, 2025): And than .. <img width="464" height="242" alt="Image" src="https://github.com/user-attachments/assets/6121368c-b8b7-4c7b-95c2-a0d8ab8c62ec" /> ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b aa4295ac10c3 14 GB 100% GPU 8192 3 minutes from now bge-m3:latest 790764642607 1.7 GB 100% GPU 4096 2 minutes from now qwen3-coder:30b ad67f85ca250 20 GB 100% GPU 4096 About a minute from now ``` the used VRAM frows to ~91-95% and ollama stopped. In Open WebUI I see this: <img width="1040" height="405" alt="Image" src="https://github.com/user-attachments/assets/a0664bf5-68c7-4cef-9f26-9c3baeb610f6" /> And here are the ollama logs: [ollama.log.gz](https://github.com/user-attachments/files/21907784/ollama.log.gz)

GiteaMirror commented

2026-04-12 20:08:54 -05:00

@somera commented on GitHub (Aug 21, 2025):

Does it happen every time with the sequence of events that triggers it?

How do you mean this question?

I uploaded the logs in the comment above.

@somera commented on GitHub (Aug 21, 2025): > Does it happen every time with the sequence of events that triggers it? How do you mean this question? I uploaded the logs in the comment above.

GiteaMirror commented

2026-04-12 20:08:54 -05:00

@jessegross commented on GitHub (Aug 21, 2025):

The main thing that I see in the logs is the runner crashing with OOM:

Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]: CUDA error: out of memory
Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]:   current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:424
Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]:   ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:84: CUDA error

It notices that pretty much immediately when the next request comes in and tries to reload the model:

Aug 21 01:04:27 AI-DEV-VM-Neptun ollama[3864382]: time=2025-08-21T01:04:27.515+02:00 level=DEBUG source=sched.go:154 msg=reloading runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=cuda runner.devices=1 runner.size="18.8 GiB" runner.vram="18.8 GiB" runner.parallel=1 runner.pid=3868900 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=42000

However, it also wants to check that any previously used VRAM has been released before it calculates offloading. Unfortunately, since the runner crashed, the VRAM has already been released and it doesn't see any change until it times out:

Aug 21 01:04:32 AI-DEV-VM-Neptun ollama[3864382]: time=2025-08-21T01:04:32.862+02:00 level=WARN source=sched.go:652 msg="gpu VRAM usage didn't recover within timeout" seconds=5.346907911 runner.size="18.8 GiB"

After the timeout, it takes another 4 seconds to reload the model:

Aug 21 01:04:37 AI-DEV-VM-Neptun ollama[3864382]: time=2025-08-21T01:04:37.126+02:00 level=INFO source=server.go:1272 msg="llama runner started in 4.07 seconds"

Now it can begin actually processing that request, which takes 5 seconds:

Aug 21 01:04:42 AI-DEV-VM-Neptun ollama[3864382]: [GIN] 2025/08/21 - 01:04:42 | 200 | 15.011479442s | 10.36.201.14 | POST "/api/chat"

In total, it took 15 seconds to respond to the new request (plus it looks like the original one took 14 seconds before the crash happened). Does that line up with what you are seeing? Assuming yes, I think that is the cause of the delay rather than anything to do with waiting for qwen3 to unload. The timeline for a model to be unloaded purely based on its timeout is generally minutes. In this case, the main influence of the additional models and longer context length is to increase memory pressure, triggering the crash in the first place.

@jessegross commented on GitHub (Aug 21, 2025): The main thing that I see in the logs is the runner crashing with OOM: ``` Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]: CUDA error: out of memory Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]: current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:424 Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]: ggml_cuda_device_malloc(&ptr, look_ahead_size, device) Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:84: CUDA error ``` It notices that pretty much immediately when the next request comes in and tries to reload the model: `Aug 21 01:04:27 AI-DEV-VM-Neptun ollama[3864382]: time=2025-08-21T01:04:27.515+02:00 level=DEBUG source=sched.go:154 msg=reloading runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=cuda runner.devices=1 runner.size="18.8 GiB" runner.vram="18.8 GiB" runner.parallel=1 runner.pid=3868900 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=42000` However, it also wants to check that any previously used VRAM has been released before it calculates offloading. Unfortunately, since the runner crashed, the VRAM has already been released and it doesn't see any change until it times out: `Aug 21 01:04:32 AI-DEV-VM-Neptun ollama[3864382]: time=2025-08-21T01:04:32.862+02:00 level=WARN source=sched.go:652 msg="gpu VRAM usage didn't recover within timeout" seconds=5.346907911 runner.size="18.8 GiB" ` After the timeout, it takes another 4 seconds to reload the model: `Aug 21 01:04:37 AI-DEV-VM-Neptun ollama[3864382]: time=2025-08-21T01:04:37.126+02:00 level=INFO source=server.go:1272 msg="llama runner started in 4.07 seconds"` Now it can begin actually processing that request, which takes 5 seconds: `Aug 21 01:04:42 AI-DEV-VM-Neptun ollama[3864382]: [GIN] 2025/08/21 - 01:04:42 | 200 | 15.011479442s | 10.36.201.14 | POST "/api/chat"` In total, it took 15 seconds to respond to the new request (plus it looks like the original one took 14 seconds before the crash happened). Does that line up with what you are seeing? Assuming yes, I think that is the cause of the delay rather than anything to do with waiting for qwen3 to unload. The timeline for a model to be unloaded purely based on its timeout is generally minutes. In this case, the main influence of the additional models and longer context length is to increase memory pressure, triggering the crash in the first place.

GiteaMirror commented

2026-04-12 20:08:54 -05:00

@somera commented on GitHub (Aug 21, 2025):

In total, it took 15 seconds to respond to the new request (plus it looks like the original one took 14 seconds before the crash happened). Does that line up with what you are seeing? Assuming yes, I think that is the cause of the delay rather than anything to do with waiting for qwen3 to unload. The timeline for a model to be unloaded purely based on its timeout is generally minutes.

The growing VRAM usage can you see here. And the crash too.

But this kills the API call from Open WebUI.

In this case, the main influence of the additional models and longer context length is to increase memory pressure, triggering the crash in the first place.

Is this than an Ollama bug or is there a property to configure this?

@somera commented on GitHub (Aug 21, 2025): > In total, it took 15 seconds to respond to the new request (plus it looks like the original one took 14 seconds before the crash happened). Does that line up with what you are seeing? Assuming yes, I think that is the cause of the delay rather than anything to do with waiting for qwen3 to unload. The timeline for a model to be unloaded purely based on its timeout is generally minutes. The growing VRAM usage can you see here. And the crash too. <img width="648" height="354" alt="Image" src="https://github.com/user-attachments/assets/d100acbd-cc9e-4ade-bfbd-d311c82d0ddd" /> But this kills the API call from Open WebUI. > In this case, the main influence of the additional models and longer context length is to increase memory pressure, triggering the crash in the first place. Is this than an Ollama bug or is there a property to configure this?

GiteaMirror commented

2026-04-12 20:08:55 -05:00

@somera commented on GitHub (Aug 21, 2025):

gpt-oss:2ob with a context of 42000 only takes 20G, so there's plenty of room for gpt-oss:20b and qwen3-coder:30b to co-reside in 48G.

In my real case it consumes 74,9% = 35,952 GB for gpt-oss:20b with num_ctx=42000.

plus 19GB for qwen3-coder:30b is more than the available 48GB VRAM.

@somera commented on GitHub (Aug 21, 2025): > gpt-oss:2ob with a context of 42000 only takes 20G, so there's plenty of room for gpt-oss:20b and qwen3-coder:30b to co-reside in 48G. In my real case it consumes 74,9% = 35,952 GB for `gpt-oss:20b` with `num_ctx=42000`. plus 19GB for qwen3-coder:30b is more than the available 48GB VRAM.

GiteaMirror commented

2026-04-12 20:08:55 -05:00

@jessegross commented on GitHub (Aug 21, 2025):

Is this than an Ollama bug or is there a property to configure this?

It's an Ollama bug but you probably can work around it by setting OLLAMA_FLASH_ATTENTION=1. (I recommend upgrading to 0.11.6 if you do this).

@jessegross commented on GitHub (Aug 21, 2025): > Is this than an Ollama bug or is there a property to configure this? It's an Ollama bug but you probably can work around it by setting OLLAMA_FLASH_ATTENTION=1. (I recommend upgrading to 0.11.6 if you do this).

GiteaMirror commented

2026-04-12 20:08:56 -05:00

@somera commented on GitHub (Aug 21, 2025):

@jessegross thanks for the analysis. I updated today to 0.11.6 and I can set the setting.

@somera commented on GitHub (Aug 21, 2025): @jessegross thanks for the analysis. I updated today to 0.11.6 and I can set the setting.

GiteaMirror commented

2026-04-12 20:08:57 -05:00

@somera commented on GitHub (Aug 21, 2025):

@jessegross is set and it looks good.

Is the setting OLLAMA_FLASH_ATTENTION=1 for temporary use?

@somera commented on GitHub (Aug 21, 2025): @jessegross is set and it looks good. <img width="459" height="237" alt="Image" src="https://github.com/user-attachments/assets/32fd90a9-c761-401b-b0d0-06cffbd75e4c" /> Is the setting `OLLAMA_FLASH_ATTENTION=1` for temporary use?

GiteaMirror commented

2026-04-12 20:08:57 -05:00

@jessegross commented on GitHub (Aug 21, 2025):

Glad to hear that it is working.

We're probably going to turn on OLLAMA_FLASH_ATTENTION by default for gpt-oss in the next release or so. This is for reasons unrelated to this bug (it's faster and it saves memory). gpt-oss is particularly impacted by the issue you saw here, so it will help mitigate that as well.

We would still like to fix the underlying issue but that will take longer. There's already a existing bug for this, so I am going to close this one now that we know it is the same issue. Thanks for your help in tracking it down.

@jessegross commented on GitHub (Aug 21, 2025): Glad to hear that it is working. We're probably going to turn on OLLAMA_FLASH_ATTENTION by default for gpt-oss in the next release or so. This is for reasons unrelated to this bug (it's faster and it saves memory). gpt-oss is particularly impacted by the issue you saw here, so it will help mitigate that as well. We would still like to fix the underlying issue but that will take longer. There's already a existing bug for this, so I am going to close this one now that we know it is the same issue. Thanks for your help in tracking it down.

GiteaMirror commented

2026-04-12 20:08:58 -05:00

@somera commented on GitHub (Aug 25, 2025):

@jessegross can OLLAMA_FLASH_ATTENTION=1 have performance issues?

@somera commented on GitHub (Aug 25, 2025): @jessegross can `OLLAMA_FLASH_ATTENTION=1` have performance issues?

GiteaMirror commented

2026-04-12 20:08:59 -05:00

@jessegross commented on GitHub (Aug 25, 2025):

I'm not aware of performance issues with gpt-oss. For other models, it should generally improve performance but it's hard to make a blanket statement. What are you seeing?

@jessegross commented on GitHub (Aug 25, 2025): I'm not aware of performance issues with gpt-oss. For other models, it should generally improve performance but it's hard to make a blanket statement. What are you seeing?

GiteaMirror commented

2026-04-12 20:09:00 -05:00

@somera commented on GitHub (Aug 26, 2025):

@jessegross a team member has performance issues with deepseek-coder-v2:16b with VS Code + Continue plugin and code autocompletion. I want to look at this.

@somera commented on GitHub (Aug 26, 2025): @jessegross a team member has performance issues with deepseek-coder-v2:16b with VS Code + Continue plugin and code autocompletion. I want to look at this.

GiteaMirror commented

2026-04-12 20:09:00 -05:00

@jessegross commented on GitHub (Aug 26, 2025):

deepseek-coder-v2:16b doesn't support flash attention and it is automatically disabled, so you shouldn't see any performance difference regardless of that setting:
time=2025-08-26T09:46:04.715-07:00 level=WARN source=server.go:204 msg="flash attention enabled but not supported by model"

@jessegross commented on GitHub (Aug 26, 2025): deepseek-coder-v2:16b doesn't support flash attention and it is automatically disabled, so you shouldn't see any performance difference regardless of that setting: `time=2025-08-26T09:46:04.715-07:00 level=WARN source=server.go:204 msg="flash attention enabled but not supported by model"`

GiteaMirror commented

2026-04-12 20:09:01 -05:00

@somera commented on GitHub (Aug 26, 2025):

@jessegross thx at the moment. If I find anything, I'll open an issue.

@somera commented on GitHub (Aug 26, 2025): @jessegross thx at the moment. If I find anything, I'll open an issue.

GiteaMirror commented

2026-04-12 20:09:01 -05:00

@somera commented on GitHub (Sep 9, 2025):

@jessegross just fyi ... the continue plugin for VS Code is the problem. We can reproduce it -> https://github.com/continuedev/continue/issues/5055

Ollama works fine!

@somera commented on GitHub (Sep 9, 2025): @jessegross just fyi ... the continue plugin for VS Code is the problem. We can reproduce it -> https://github.com/continuedev/continue/issues/5055 Ollama works fine!

GiteaMirror referenced this issue

2026-04-22 10:49:05 -05:00

[GH-ISSUE #7966] ggml_cuda_cpy_fn: unsupported type combination (q4_0 to f32) in pre-release version #30858

GiteaMirror referenced this issue

2026-04-28 20:38:10 -05:00

[GH-ISSUE #7966] ggml_cuda_cpy_fn: unsupported type combination (q4_0 to f32) in pre-release version #51609

GiteaMirror referenced this issue

2026-05-04 09:33:42 -05:00

[GH-ISSUE #7966] ggml_cuda_cpy_fn: unsupported type combination (q4_0 to f32) in pre-release version #67154

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#7966