[GH-ISSUE #4093] 使用本地知识库时模型每次都要重新加载 #2542

Closed
opened 2026-04-12 12:51:59 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @androidsr on GitHub (May 2, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4093

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

执行对话,查询本地向量库,将查询到的结果和问题一并发给模型处理。查询出来的文本并不大或者没有。此时模型处理时会很慢。查看显卡内存情况。显示的重新加载了对话模型。

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.1.32

Originally created by @androidsr on GitHub (May 2, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4093 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? 执行对话,查询本地向量库,将查询到的结果和问题一并发给模型处理。查询出来的文本并不大或者没有。此时模型处理时会很慢。查看显卡内存情况。显示的重新加载了对话模型。 ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.32
GiteaMirror added the bug label 2026-04-12 12:51:59 -05:00
Author
Owner

@dhiltgen commented on GitHub (May 2, 2024):

You can use the keep_alive parameter to change how long models stay loaded. https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately

If you're trying to load multiple models in your use-case, 0.1.33 adds concurrency support as an experimental feature which you can turn on with OLLAMA_MAX_LOADED_MODELS. https://github.com/ollama/ollama/releases

<!-- gh-comment-id:2091167235 --> @dhiltgen commented on GitHub (May 2, 2024): You can use the `keep_alive` parameter to change how long models stay loaded. https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately If you're trying to load multiple models in your use-case, 0.1.33 adds concurrency support as an experimental feature which you can turn on with OLLAMA_MAX_LOADED_MODELS. https://github.com/ollama/ollama/releases
Author
Owner

@jmorganca commented on GitHub (May 9, 2024):

Yes as mentioned by @dhiltgen OLLAMA_MAX_LOADED_MODELS will be 0 default soon, meaning Ollama will support loading multiple models. This will allow you to run both a language and an embedding model side by side.

To use this today, set OLLAMA_MAX_LOADED_MODELS:

OLLAMA_MAX_LOADED_MODELS=0 ollama serve
<!-- gh-comment-id:2103436447 --> @jmorganca commented on GitHub (May 9, 2024): Yes as mentioned by @dhiltgen `OLLAMA_MAX_LOADED_MODELS` will be 0 default soon, meaning Ollama will support loading multiple models. This will allow you to run both a language and an embedding model side by side. To use this today, set `OLLAMA_MAX_LOADED_MODELS`: ``` OLLAMA_MAX_LOADED_MODELS=0 ollama serve ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2542