[GH-ISSUE #5842] Model Reloading and Excessive VRAM Usage Issues with Ollama Backend #3643

Closed
opened 2026-04-12 14:25:18 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @ALEX000V on GitHub (Jul 22, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5842

What is the issue?

Relevant environment info

- OS: Windows 11 23H2
- Continue: v0.8.43 / v0.0.55
- IDE: VSCode 1.91.1 / IntelliJ IDEA 2024.1.4 (Community Edition)
- Model: deepseek-coder-v2:16b-lite-instruct-q4_0 
- ollama: v0.2.7
- CUDA: V12.3.103
- config.json:
  
{
  "models": [
    {
      "title": "Ollama",
      "provider": "ollama",
      "model": "AUTODETECT"
    },
    {
      "title": "deepseek-coder-v2:16b",
      "provider": "ollama",
      "model": "deepseek-coder-v2:16b-lite-instruct-q4_0",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "deepseek-coder-v2:16b",
    "provider": "ollama",
    "model": "deepseek-coder-v2:16b-lite-instruct-q4_0"
  },
  "allowAnonymousTelemetry": false
}

Description

Description:
I am encountering two distinct but related issues when using the Continue plugin in VSCode and IDEA with Ollama as the backend for model processing. The first issue involves models being repeatedly loaded and unloaded when accessed through different interfaces, and the second issue pertains to abnormal VRAM usage when models are accessed via the Continue plugin compared to other methods.

VRAM Changes Example:

When OLLAMA_NUM_PARALLEL is set to 4, after loading the deepseek-coder-v2:16b-lite-instruct-q4_0 model using the Continue plugin, executing ollama run deepseek-coder-v2:16b-lite-instruct-q4_0 via the command line, and then sending information to the model using the Continue plugin again.

vram_load

To reproduce

Issue 1: Model Reloading

  • Steps to Reproduce:
    1. Open VSCode/IDEA and activate the Continue plugin.
    2. Use the Continue plugin to send a request to Ollama to load a specific model.
    3. Attempt to access the same model via another non-Continue plugin method (e.g., command line, another plugin, or web UI).
    4. Observe that the model is unloaded and then reloaded, rather than being reused from memory.
    5. Alternatively, first load the model using a non-Continue plugin method, then attempt to access it via the Continue plugin.
    6. Observe that the model is again unloaded and then reloaded.
  • Expected Behavior:
    Once a model is loaded by any interface, it should be accessible by all other interfaces without needing to be reloaded.
  • Actual Behavior:
    Each time the model is accessed through a different (Continue vs. non-Continue) interface, it is unloaded and then reloaded.

Issue 2: Excessive VRAM Usage

  • Steps to Reproduce:
    1. Use the Continue plugin to load a model (e.g., deepseek-coder-v2:16b-lite-instruct-q4_0).
    2. Compare the VRAM usage with the same model loaded via other methods (e.g., command line, other plugins, or web UI).
    3. Observe that the VRAM usage is significantly higher when using the Continue plugin.
  • Expected Behavior:
    The VRAM usage should be consistent regardless of the method used to load the model.
  • Actual Behavior:
    The VRAM usage is higher when using the Continue plugin compared to other methods. For example, with each increment of OLLAMA_NUM_PARALLEL by 1, the command line method adds 0.6GiB, while the Continue plugin adds 1.2GiB.

Log output

No response

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.2.7

Originally created by @ALEX000V on GitHub (Jul 22, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5842 ### What is the issue? ### Relevant environment info ```Markdown - OS: Windows 11 23H2 - Continue: v0.8.43 / v0.0.55 - IDE: VSCode 1.91.1 / IntelliJ IDEA 2024.1.4 (Community Edition) - Model: deepseek-coder-v2:16b-lite-instruct-q4_0 - ollama: v0.2.7 - CUDA: V12.3.103 - config.json: { "models": [ { "title": "Ollama", "provider": "ollama", "model": "AUTODETECT" }, { "title": "deepseek-coder-v2:16b", "provider": "ollama", "model": "deepseek-coder-v2:16b-lite-instruct-q4_0", "apiBase": "http://localhost:11434" } ], "tabAutocompleteModel": { "title": "deepseek-coder-v2:16b", "provider": "ollama", "model": "deepseek-coder-v2:16b-lite-instruct-q4_0" }, "allowAnonymousTelemetry": false } ``` ### Description **Description:** I am encountering two distinct but related issues when using the Continue plugin in VSCode and IDEA with Ollama as the backend for model processing. The first issue involves models being repeatedly loaded and unloaded when accessed through different interfaces, and the second issue pertains to abnormal VRAM usage when models are accessed via the Continue plugin compared to other methods. **VRAM Changes Example:** When `OLLAMA_NUM_PARALLEL` is set to 4, after loading the `deepseek-coder-v2:16b-lite-instruct-q4_0` model using the Continue plugin, executing `ollama run deepseek-coder-v2:16b-lite-instruct-q4_0` via the command line, and then sending information to the model using the Continue plugin again. ![vram_load](https://github.com/user-attachments/assets/0c11355e-3107-464e-bd5e-c55594c66bb7) ### To reproduce **Issue 1: Model Reloading** - **Steps to Reproduce:** 1. Open VSCode/IDEA and activate the Continue plugin. 2. Use the Continue plugin to send a request to Ollama to load a specific model. 3. Attempt to access the same model via another non-Continue plugin method (e.g., command line, another plugin, or web UI). 4. Observe that the model is unloaded and then reloaded, rather than being reused from memory. 5. Alternatively, first load the model using a non-Continue plugin method, then attempt to access it via the Continue plugin. 6. Observe that the model is again unloaded and then reloaded. - **Expected Behavior:** Once a model is loaded by any interface, it should be accessible by all other interfaces without needing to be reloaded. - **Actual Behavior:** Each time the model is accessed through a different (Continue vs. non-Continue) interface, it is unloaded and then reloaded. **Issue 2: Excessive VRAM Usage** - **Steps to Reproduce:** 1. Use the Continue plugin to load a model (e.g., `deepseek-coder-v2:16b-lite-instruct-q4_0`). 2. Compare the VRAM usage with the same model loaded via other methods (e.g., command line, other plugins, or web UI). 3. Observe that the VRAM usage is significantly higher when using the Continue plugin. - **Expected Behavior:** The VRAM usage should be consistent regardless of the method used to load the model. - **Actual Behavior:** The VRAM usage is higher when using the Continue plugin compared to other methods. For example, with each increment of `OLLAMA_NUM_PARALLEL` by 1, the command line method adds 0.6GiB, while the Continue plugin adds 1.2GiB. ### Log output _No response_ ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.2.7
GiteaMirror added the bug label 2026-04-12 14:25:18 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 22, 2024):

If you can provide server logs from ollama it will be easier to diagnose.

However, based on your report, I'm going to guess that your two access methods are using different context sizes. When ollama loads a model, it does so with a particular context size, 2048 by default. If the context size changes, it's effectively a different model, so ollama will unload and reload to be able to re-allocate VRAM. This will show up in the logs as a change in the n_ctx value.

<!-- gh-comment-id:2242294120 --> @rick-github commented on GitHub (Jul 22, 2024): If you can provide server logs from ollama it will be easier to diagnose. However, based on your report, I'm going to guess that your two access methods are using different context sizes. When ollama loads a model, it does so with a particular context size, 2048 by default. If the context size changes, it's effectively a different model, so ollama will unload and reload to be able to re-allocate VRAM. This will show up in the logs as a change in the `n_ctx` value.
Author
Owner

@ALEX000V commented on GitHub (Jul 22, 2024):

If you can provide server logs from ollama it will be easier to diagnose.

However, based on your report, I'm going to guess that your two access methods are using different context sizes. When ollama loads a model, it does so with a particular context size, 2048 by default. If the context size changes, it's effectively a different model, so ollama will unload and reload to be able to re-allocate VRAM. This will show up in the logs as a change in the value.n_ctx

Thank you very much for your reply. Your guess was correct. After setting the model to a context size of 2048 in the continue plugin, the model became shareable with other methods. Thank you for your help.

<!-- gh-comment-id:2242365722 --> @ALEX000V commented on GitHub (Jul 22, 2024): > If you can provide server logs from ollama it will be easier to diagnose. > > However, based on your report, I'm going to guess that your two access methods are using different context sizes. When ollama loads a model, it does so with a particular context size, 2048 by default. If the context size changes, it's effectively a different model, so ollama will unload and reload to be able to re-allocate VRAM. This will show up in the logs as a change in the value.`n_ctx` Thank you very much for your reply. Your guess was correct. After setting the model to a context size of 2048 in the continue plugin, the model became shareable with other methods. Thank you for your help.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3643