[GH-ISSUE #11997] Unused modell is not unloaded from VRAM with bigger num_ctx #7966

Closed
opened 2026-04-12 20:08:49 -05:00 by GiteaMirror · 19 comments
Owner

Originally created by @somera on GitHub (Aug 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11997

What is the issue?

I have one RTX A4000 with 48GB VRAM. These models:

NAME               ID              SIZE     PROCESSOR    CONTEXT    UNTIL
qwen3-coder:30b    ad67f85ca250    20 GB    100% GPU     4096       4 minutes from now
gpt-oss:20b        aa4295ac10c3    14 GB    100% GPU     8192       4 minutes from now

can both be loaded into VRAM and used in parallel with a small num_ctx.

I'm working with Open WebUI v0.6.22.

I have a small Python project with several files, ~1800 lines in total. When I paste this into my Open WebUI chat with num_ctx=42000 using the gpt-oss:20b model, everything works fine. The review is generated, and no other models are loaded into VRAM.

However, when I use qwen3-coder:30b, then both models remain loaded in VRAM with the status "4 minutes from now".

When I repeat the test with the ~1800 lines of Python code and num_ctx=42000, the request is sent to Ollama and starts processing. The memory usage grows with the larger num_ctx.

My expectation is that the unused model (qwen3-coder:30b in this case) would be unloaded automatically so Ollama could finish the job. But instead, Ollama stops working on the prompt because there isn’t enough free VRAM for both qwen3-coder:30b and gpt-oss:20b with num_ctx=42000. I have to wait until the unused model (qwen3-coder:30b) is eventually unloaded from VRAM before the request can be processed again.

Is my expectation wrong or is there a bug?

Cause unloading unused models works fine, when the num_ctx is small (4096). I see this when I'm using my preload model script:

$ ./ollama_preload_models_v7.sh
✅ Ollama is running.
⏭️ Skipping mxbai-embed-large:latest (explicitly excluded)
⏭️ Skipping nomic-embed-text:latest (explicitly excluded)
⏭️ Skipping deepseek-coder-v2:16b (explicitly excluded)
⏭️ Skipping bge-m3:latest (explicitly excluded)
Preloading llama3.2:3b (2.0 GB, 2.0 GB)...
I'm ready when you are. What's on your mind?

✅ Done in 2 seconds.
Preloading phi4-mini:3.8b (2.5 GB, 2.5 GB)...
Always ready to assist you! What can I help with today?

✅ Done in 2 seconds.
Preloading llama2-uncensored:latest (3.8 GB, 3.7 GB)...
Ready.

✅ Done in 1 seconds.
Preloading mistral:7b-instruct (4.4 GB, 4.3 GB)...
 Absolutely! I'm here to help. What do you need assistance with today?

✅ Done in 2 seconds.
Preloading llama3.2-vision:11b (7.8 GB, 7.7 GB)...
Go ahead!

✅ Done in 3 seconds.
Preloading deepseek-coder-v2-fixed:latest (8.9 GB, 8.8 GB)...
 Yes, I am ready. How can I assist you today?

✅ Done in 3 seconds.
Preloading phi4:latest (9.1 GB, 9.0 GB)...
Absolutely! I'm ready to help. What do you need assistance with today? Whether it's answering questions, providing information, or helping brainstorm ideas, just let me
know how I can assist you. 😊

...

✅ Done in 12 seconds.
Preloading qwen3-coder:30b (18 GB, 18.0 GB)...
Ready! I'm here and ready to help with whatever you'd like to discuss or work on. What can I assist you with today?

✅ Done in 5 seconds.
Preloading deepseek-r1:32b (19 GB, 19.0 GB)...
Yes, I'm ready! How can I assist you today?

✅ Done in 6 seconds.
All models processed.

Relevant log output

No logs

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.11.5

Originally created by @somera on GitHub (Aug 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11997 ### What is the issue? I have one RTX A4000 with 48GB VRAM. These models: ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3-coder:30b ad67f85ca250 20 GB 100% GPU 4096 4 minutes from now gpt-oss:20b aa4295ac10c3 14 GB 100% GPU 8192 4 minutes from now ``` can both be loaded into VRAM and used in parallel with a small `num_ctx`. I'm working with **Open WebUI v0.6.22**. I have a small Python project with several files, ~1800 lines in total. When I paste this into my Open WebUI chat with `num_ctx=42000` using the `gpt-oss:20b` model, everything works fine. The review is generated, and no other models are loaded into VRAM. However, when I use `qwen3-coder:30b`, then both models remain loaded in VRAM with the status "4 minutes from now". When I repeat the test with the ~1800 lines of Python code and `num_ctx=42000`, the request is sent to Ollama and starts processing. The memory usage grows with the larger `num_ctx`. My expectation is that the unused model (`qwen3-coder:30b` in this case) would be unloaded automatically so Ollama could finish the job. But instead, Ollama stops working on the prompt because there isn’t enough free VRAM for both `qwen3-coder:30b` and `gpt-oss:20b` with `num_ctx=42000`. I have to wait until the unused model (`qwen3-coder:30b`) is eventually unloaded from VRAM before the request can be processed again. Is my expectation wrong or is there a bug? Cause unloading unused models works fine, when the num_ctx is small (4096). I see this when I'm using my preload model script: ``` $ ./ollama_preload_models_v7.sh ✅ Ollama is running. ⏭️ Skipping mxbai-embed-large:latest (explicitly excluded) ⏭️ Skipping nomic-embed-text:latest (explicitly excluded) ⏭️ Skipping deepseek-coder-v2:16b (explicitly excluded) ⏭️ Skipping bge-m3:latest (explicitly excluded) Preloading llama3.2:3b (2.0 GB, 2.0 GB)... I'm ready when you are. What's on your mind? ✅ Done in 2 seconds. Preloading phi4-mini:3.8b (2.5 GB, 2.5 GB)... Always ready to assist you! What can I help with today? ✅ Done in 2 seconds. Preloading llama2-uncensored:latest (3.8 GB, 3.7 GB)... Ready. ✅ Done in 1 seconds. Preloading mistral:7b-instruct (4.4 GB, 4.3 GB)... Absolutely! I'm here to help. What do you need assistance with today? ✅ Done in 2 seconds. Preloading llama3.2-vision:11b (7.8 GB, 7.7 GB)... Go ahead! ✅ Done in 3 seconds. Preloading deepseek-coder-v2-fixed:latest (8.9 GB, 8.8 GB)... Yes, I am ready. How can I assist you today? ✅ Done in 3 seconds. Preloading phi4:latest (9.1 GB, 9.0 GB)... Absolutely! I'm ready to help. What do you need assistance with today? Whether it's answering questions, providing information, or helping brainstorm ideas, just let me know how I can assist you. 😊 ... ✅ Done in 12 seconds. Preloading qwen3-coder:30b (18 GB, 18.0 GB)... Ready! I'm here and ready to help with whatever you'd like to discuss or work on. What can I assist you with today? ✅ Done in 5 seconds. Preloading deepseek-r1:32b (19 GB, 19.0 GB)... Yes, I'm ready! How can I assist you today? ✅ Done in 6 seconds. All models processed. ``` ### Relevant log output ```shell No logs ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.11.5
GiteaMirror added the bug label 2026-04-12 20:08:49 -05:00
Author
Owner

@jessegross commented on GitHub (Aug 20, 2025):

Can you please post the server logs with OLLAMA_DEBUG=1 set? Does it happen every time with the sequence of events that triggers it?

<!-- gh-comment-id:3208272312 --> @jessegross commented on GitHub (Aug 20, 2025): Can you please post the [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) with OLLAMA_DEBUG=1 set? Does it happen every time with the sequence of events that triggers it?
Author
Owner

@somera commented on GitHub (Aug 20, 2025):

@jessegross now I'm off. I'll send the logs tomorrow.

<!-- gh-comment-id:3208279343 --> @somera commented on GitHub (Aug 20, 2025): @jessegross now I'm off. I'll send the logs tomorrow.
Author
Owner

@rick-github commented on GitHub (Aug 20, 2025):

gpt-oss:2ob with a context of 42000 only takes 20G, so there's plenty of room for gpt-oss:20b and qwen3-coder:30b to co-reside in 48G.

NAME               ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
qwen3-coder:30b    ad67f85ca250    20 GB    100% GPU     4096       Forever    
gpt-oss:20b        aa4295ac10c3    20 GB    100% GPU     42000      Forever    

Server logs may aid in debugging.

<!-- gh-comment-id:3208287556 --> @rick-github commented on GitHub (Aug 20, 2025): gpt-oss:2ob with a context of 42000 only takes 20G, so there's plenty of room for gpt-oss:20b and qwen3-coder:30b to co-reside in 48G. ```console NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3-coder:30b ad67f85ca250 20 GB 100% GPU 4096 Forever gpt-oss:20b aa4295ac10c3 20 GB 100% GPU 42000 Forever ``` [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@somera commented on GitHub (Aug 20, 2025):

First run:

NAME             ID              SIZE      PROCESSOR    CONTEXT    UNTIL
gpt-oss:20b      aa4295ac10c3    20 GB     100% GPU     42000      4 minutes from now
bge-m3:latest    790764642607    1.7 GB    100% GPU     4096       4 minutes from now

But nvi-top shows during processing:

Image

And 74,9% = 35,952GB plus 19GB for qwen3-coder:30b is more than available 48GB.

@rick-github how can you explain this?

<!-- gh-comment-id:3208314811 --> @somera commented on GitHub (Aug 20, 2025): First run: ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b aa4295ac10c3 20 GB 100% GPU 42000 4 minutes from now bge-m3:latest 790764642607 1.7 GB 100% GPU 4096 4 minutes from now ``` But nvi-top shows during processing: <img width="1087" height="452" alt="Image" src="https://github.com/user-attachments/assets/c6222ba3-04fe-40b5-b1ff-bb25f3640475" /> And 74,9% = 35,952GB plus 19GB for `qwen3-coder:30b` is more than available 48GB. @rick-github how can you explain this?
Author
Owner

@somera commented on GitHub (Aug 20, 2025):

And than ..

Image
NAME               ID              SIZE      PROCESSOR    CONTEXT    UNTIL
gpt-oss:20b        aa4295ac10c3    14 GB     100% GPU     8192       3 minutes from now
bge-m3:latest      790764642607    1.7 GB    100% GPU     4096       2 minutes from now
qwen3-coder:30b    ad67f85ca250    20 GB     100% GPU     4096       About a minute from now

the used VRAM frows to ~91-95% and ollama stopped.

In Open WebUI I see this:

Image

And here are the ollama logs:

ollama.log.gz

<!-- gh-comment-id:3208331139 --> @somera commented on GitHub (Aug 20, 2025): And than .. <img width="464" height="242" alt="Image" src="https://github.com/user-attachments/assets/6121368c-b8b7-4c7b-95c2-a0d8ab8c62ec" /> ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b aa4295ac10c3 14 GB 100% GPU 8192 3 minutes from now bge-m3:latest 790764642607 1.7 GB 100% GPU 4096 2 minutes from now qwen3-coder:30b ad67f85ca250 20 GB 100% GPU 4096 About a minute from now ``` the used VRAM frows to ~91-95% and ollama stopped. In Open WebUI I see this: <img width="1040" height="405" alt="Image" src="https://github.com/user-attachments/assets/a0664bf5-68c7-4cef-9f26-9c3baeb610f6" /> And here are the ollama logs: [ollama.log.gz](https://github.com/user-attachments/files/21907784/ollama.log.gz)
Author
Owner

@somera commented on GitHub (Aug 21, 2025):

Does it happen every time with the sequence of events that triggers it?

How do you mean this question?

I uploaded the logs in the comment above.

<!-- gh-comment-id:3209651502 --> @somera commented on GitHub (Aug 21, 2025): > Does it happen every time with the sequence of events that triggers it? How do you mean this question? I uploaded the logs in the comment above.
Author
Owner

@jessegross commented on GitHub (Aug 21, 2025):

The main thing that I see in the logs is the runner crashing with OOM:

Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]: CUDA error: out of memory
Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]:   current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:424
Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]:   ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:84: CUDA error

It notices that pretty much immediately when the next request comes in and tries to reload the model:

Aug 21 01:04:27 AI-DEV-VM-Neptun ollama[3864382]: time=2025-08-21T01:04:27.515+02:00 level=DEBUG source=sched.go:154 msg=reloading runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=cuda runner.devices=1 runner.size="18.8 GiB" runner.vram="18.8 GiB" runner.parallel=1 runner.pid=3868900 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=42000

However, it also wants to check that any previously used VRAM has been released before it calculates offloading. Unfortunately, since the runner crashed, the VRAM has already been released and it doesn't see any change until it times out:

Aug 21 01:04:32 AI-DEV-VM-Neptun ollama[3864382]: time=2025-08-21T01:04:32.862+02:00 level=WARN source=sched.go:652 msg="gpu VRAM usage didn't recover within timeout" seconds=5.346907911 runner.size="18.8 GiB"

After the timeout, it takes another 4 seconds to reload the model:

Aug 21 01:04:37 AI-DEV-VM-Neptun ollama[3864382]: time=2025-08-21T01:04:37.126+02:00 level=INFO source=server.go:1272 msg="llama runner started in 4.07 seconds"

Now it can begin actually processing that request, which takes 5 seconds:

Aug 21 01:04:42 AI-DEV-VM-Neptun ollama[3864382]: [GIN] 2025/08/21 - 01:04:42 | 200 | 15.011479442s | 10.36.201.14 | POST "/api/chat"

In total, it took 15 seconds to respond to the new request (plus it looks like the original one took 14 seconds before the crash happened). Does that line up with what you are seeing? Assuming yes, I think that is the cause of the delay rather than anything to do with waiting for qwen3 to unload. The timeline for a model to be unloaded purely based on its timeout is generally minutes. In this case, the main influence of the additional models and longer context length is to increase memory pressure, triggering the crash in the first place.

<!-- gh-comment-id:3211789340 --> @jessegross commented on GitHub (Aug 21, 2025): The main thing that I see in the logs is the runner crashing with OOM: ``` Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]: CUDA error: out of memory Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]: current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:424 Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]: ggml_cuda_device_malloc(&ptr, look_ahead_size, device) Aug 21 01:04:26 AI-DEV-VM-Neptun ollama[3864382]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:84: CUDA error ``` It notices that pretty much immediately when the next request comes in and tries to reload the model: `Aug 21 01:04:27 AI-DEV-VM-Neptun ollama[3864382]: time=2025-08-21T01:04:27.515+02:00 level=DEBUG source=sched.go:154 msg=reloading runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=cuda runner.devices=1 runner.size="18.8 GiB" runner.vram="18.8 GiB" runner.parallel=1 runner.pid=3868900 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=42000` However, it also wants to check that any previously used VRAM has been released before it calculates offloading. Unfortunately, since the runner crashed, the VRAM has already been released and it doesn't see any change until it times out: `Aug 21 01:04:32 AI-DEV-VM-Neptun ollama[3864382]: time=2025-08-21T01:04:32.862+02:00 level=WARN source=sched.go:652 msg="gpu VRAM usage didn't recover within timeout" seconds=5.346907911 runner.size="18.8 GiB" ` After the timeout, it takes another 4 seconds to reload the model: `Aug 21 01:04:37 AI-DEV-VM-Neptun ollama[3864382]: time=2025-08-21T01:04:37.126+02:00 level=INFO source=server.go:1272 msg="llama runner started in 4.07 seconds"` Now it can begin actually processing that request, which takes 5 seconds: `Aug 21 01:04:42 AI-DEV-VM-Neptun ollama[3864382]: [GIN] 2025/08/21 - 01:04:42 | 200 | 15.011479442s | 10.36.201.14 | POST "/api/chat"` In total, it took 15 seconds to respond to the new request (plus it looks like the original one took 14 seconds before the crash happened). Does that line up with what you are seeing? Assuming yes, I think that is the cause of the delay rather than anything to do with waiting for qwen3 to unload. The timeline for a model to be unloaded purely based on its timeout is generally minutes. In this case, the main influence of the additional models and longer context length is to increase memory pressure, triggering the crash in the first place.
Author
Owner

@somera commented on GitHub (Aug 21, 2025):

In total, it took 15 seconds to respond to the new request (plus it looks like the original one took 14 seconds before the crash happened). Does that line up with what you are seeing? Assuming yes, I think that is the cause of the delay rather than anything to do with waiting for qwen3 to unload. The timeline for a model to be unloaded purely based on its timeout is generally minutes.

The growing VRAM usage can you see here. And the crash too.

Image

But this kills the API call from Open WebUI.

In this case, the main influence of the additional models and longer context length is to increase memory pressure, triggering the crash in the first place.

Is this than an Ollama bug or is there a property to configure this?

<!-- gh-comment-id:3211835911 --> @somera commented on GitHub (Aug 21, 2025): > In total, it took 15 seconds to respond to the new request (plus it looks like the original one took 14 seconds before the crash happened). Does that line up with what you are seeing? Assuming yes, I think that is the cause of the delay rather than anything to do with waiting for qwen3 to unload. The timeline for a model to be unloaded purely based on its timeout is generally minutes. The growing VRAM usage can you see here. And the crash too. <img width="648" height="354" alt="Image" src="https://github.com/user-attachments/assets/d100acbd-cc9e-4ade-bfbd-d311c82d0ddd" /> But this kills the API call from Open WebUI. > In this case, the main influence of the additional models and longer context length is to increase memory pressure, triggering the crash in the first place. Is this than an Ollama bug or is there a property to configure this?
Author
Owner

@somera commented on GitHub (Aug 21, 2025):

gpt-oss:2ob with a context of 42000 only takes 20G, so there's plenty of room for gpt-oss:20b and qwen3-coder:30b to co-reside in 48G.

In my real case it consumes 74,9% = 35,952 GB for gpt-oss:20b with num_ctx=42000.

plus 19GB for qwen3-coder:30b is more than the available 48GB VRAM.

<!-- gh-comment-id:3211840436 --> @somera commented on GitHub (Aug 21, 2025): > gpt-oss:2ob with a context of 42000 only takes 20G, so there's plenty of room for gpt-oss:20b and qwen3-coder:30b to co-reside in 48G. In my real case it consumes 74,9% = 35,952 GB for `gpt-oss:20b` with `num_ctx=42000`. plus 19GB for qwen3-coder:30b is more than the available 48GB VRAM.
Author
Owner

@jessegross commented on GitHub (Aug 21, 2025):

Is this than an Ollama bug or is there a property to configure this?

It's an Ollama bug but you probably can work around it by setting OLLAMA_FLASH_ATTENTION=1. (I recommend upgrading to 0.11.6 if you do this).

<!-- gh-comment-id:3211876041 --> @jessegross commented on GitHub (Aug 21, 2025): > Is this than an Ollama bug or is there a property to configure this? It's an Ollama bug but you probably can work around it by setting OLLAMA_FLASH_ATTENTION=1. (I recommend upgrading to 0.11.6 if you do this).
Author
Owner

@somera commented on GitHub (Aug 21, 2025):

@jessegross thanks for the analysis. I updated today to 0.11.6 and I can set the setting.

<!-- gh-comment-id:3211918914 --> @somera commented on GitHub (Aug 21, 2025): @jessegross thanks for the analysis. I updated today to 0.11.6 and I can set the setting.
Author
Owner

@somera commented on GitHub (Aug 21, 2025):

@jessegross is set and it looks good.

Image

Is the setting OLLAMA_FLASH_ATTENTION=1 for temporary use?

<!-- gh-comment-id:3212046933 --> @somera commented on GitHub (Aug 21, 2025): @jessegross is set and it looks good. <img width="459" height="237" alt="Image" src="https://github.com/user-attachments/assets/32fd90a9-c761-401b-b0d0-06cffbd75e4c" /> Is the setting `OLLAMA_FLASH_ATTENTION=1` for temporary use?
Author
Owner

@jessegross commented on GitHub (Aug 21, 2025):

Glad to hear that it is working.

We're probably going to turn on OLLAMA_FLASH_ATTENTION by default for gpt-oss in the next release or so. This is for reasons unrelated to this bug (it's faster and it saves memory). gpt-oss is particularly impacted by the issue you saw here, so it will help mitigate that as well.

We would still like to fix the underlying issue but that will take longer. There's already a existing bug for this, so I am going to close this one now that we know it is the same issue. Thanks for your help in tracking it down.

<!-- gh-comment-id:3212078834 --> @jessegross commented on GitHub (Aug 21, 2025): Glad to hear that it is working. We're probably going to turn on OLLAMA_FLASH_ATTENTION by default for gpt-oss in the next release or so. This is for reasons unrelated to this bug (it's faster and it saves memory). gpt-oss is particularly impacted by the issue you saw here, so it will help mitigate that as well. We would still like to fix the underlying issue but that will take longer. There's already a existing bug for this, so I am going to close this one now that we know it is the same issue. Thanks for your help in tracking it down.
Author
Owner

@somera commented on GitHub (Aug 25, 2025):

@jessegross can OLLAMA_FLASH_ATTENTION=1 have performance issues?

<!-- gh-comment-id:3221430959 --> @somera commented on GitHub (Aug 25, 2025): @jessegross can `OLLAMA_FLASH_ATTENTION=1` have performance issues?
Author
Owner

@jessegross commented on GitHub (Aug 25, 2025):

I'm not aware of performance issues with gpt-oss. For other models, it should generally improve performance but it's hard to make a blanket statement. What are you seeing?

<!-- gh-comment-id:3221806642 --> @jessegross commented on GitHub (Aug 25, 2025): I'm not aware of performance issues with gpt-oss. For other models, it should generally improve performance but it's hard to make a blanket statement. What are you seeing?
Author
Owner

@somera commented on GitHub (Aug 26, 2025):

@jessegross a team member has performance issues with deepseek-coder-v2:16b with VS Code + Continue plugin and code autocompletion. I want to look at this.

<!-- gh-comment-id:3222863855 --> @somera commented on GitHub (Aug 26, 2025): @jessegross a team member has performance issues with deepseek-coder-v2:16b with VS Code + Continue plugin and code autocompletion. I want to look at this.
Author
Owner

@jessegross commented on GitHub (Aug 26, 2025):

deepseek-coder-v2:16b doesn't support flash attention and it is automatically disabled, so you shouldn't see any performance difference regardless of that setting:
time=2025-08-26T09:46:04.715-07:00 level=WARN source=server.go:204 msg="flash attention enabled but not supported by model"

<!-- gh-comment-id:3224974587 --> @jessegross commented on GitHub (Aug 26, 2025): deepseek-coder-v2:16b doesn't support flash attention and it is automatically disabled, so you shouldn't see any performance difference regardless of that setting: `time=2025-08-26T09:46:04.715-07:00 level=WARN source=server.go:204 msg="flash attention enabled but not supported by model"`
Author
Owner

@somera commented on GitHub (Aug 26, 2025):

@jessegross thx at the moment. If I find anything, I'll open an issue.

<!-- gh-comment-id:3224996692 --> @somera commented on GitHub (Aug 26, 2025): @jessegross thx at the moment. If I find anything, I'll open an issue.
Author
Owner

@somera commented on GitHub (Sep 9, 2025):

@jessegross just fyi ... the continue plugin for VS Code is the problem. We can reproduce it -> https://github.com/continuedev/continue/issues/5055

Ollama works fine!

<!-- gh-comment-id:3271749968 --> @somera commented on GitHub (Sep 9, 2025): @jessegross just fyi ... the continue plugin for VS Code is the problem. We can reproduce it -> https://github.com/continuedev/continue/issues/5055 Ollama works fine!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7966