[GH-ISSUE #9975] dify中不同的模型启动模型上下文长度参数会导致启动不同的模型实例到显存中吗 #6534

Closed
opened 2026-04-12 18:08:58 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @jaybom on GitHub (Mar 25, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9975

What is the issue?

When running QWQ on 23090 GPUs, I've observed three different memory sizes when executing 'ollama ps': 23GB, 40GB, and 60GB. When using 23GB and 40GB, it shows 100% GPU utilization. However, with 60GB, it shows 20% CPU and 80% GPU utilization. On other platforms like Dify, when loading a new model with different parameters while the old model hasn't been successfully unloaded, it might create a new model instance, causing the old model to be offloaded to CPU, which can result in a 1000x slowdown in inference speed."

Relevant log output


OS

Windows

GPU

Nvidia

CPU

No response

Ollama version

0.5.7

Originally created by @jaybom on GitHub (Mar 25, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9975 ### What is the issue? When running QWQ on 23090 GPUs, I've observed three different memory sizes when executing 'ollama ps': 23GB, 40GB, and 60GB. When using 23GB and 40GB, it shows 100% GPU utilization. However, with 60GB, it shows 20% CPU and 80% GPU utilization. On other platforms like Dify, when loading a new model with different parameters while the old model hasn't been successfully unloaded, it might create a new model instance, causing the old model to be offloaded to CPU, which can result in a 1000x slowdown in inference speed." ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU _No response_ ### Ollama version 0.5.7
GiteaMirror added the bug label 2026-04-12 18:08:58 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 25, 2025):

However, with 60GB, it shows 20% CPU and 80% GPU utilization

If the model weights, context cache and model graph don't fit on a GPU, some of it will be off-loaded to the CPU. So when the model requires 40G, it fits on a GPU. When the model requires 60G, 80% (60 * .8 = 48) runs on the GPU, and 20% (60 * .2 = 12) runs on the CPU.

<!-- gh-comment-id:2751165674 --> @rick-github commented on GitHub (Mar 25, 2025): > However, with 60GB, it shows 20% CPU and 80% GPU utilization If the model weights, context cache and model graph don't fit on a GPU, some of it will be off-loaded to the CPU. So when the model requires 40G, it fits on a GPU. When the model requires 60G, 80% (60 * .8 = 48) runs on the GPU, and 20% (60 * .2 = 12) runs on the CPU.
Author
Owner

@NGC13009 commented on GitHub (Mar 26, 2025):

我猜你指的是:调用不同ctx size会导致重启模型吗?

之前我遇见过类似的问题,ollama一般不会重启模型,除非当前运行的实例所配置的上下文小于你调用的上下文大小,那么会重新拉起。

你需要确保所有外部调用前端(例如dify,或者类似于 cherry studio,chatbox,page assist,continue之类的)所调用的上下文长度(context length或者ctx size之类的字样)小于等于模型Modelfile指定的上下文长度。或者,先由一个更大上下文设置的前端调用并拉起模型,然后再使用其他小上下文调用的模型。否则会触发重新拉起模型。ollama默认的ctx size是2048,需要通过重写Modelfile的方式指定模型的上下文大小。你的前端程序如果需要更多上下文,那么会导致ollama重启一个更长上下文的实例,这应该是因为KVcache预分配的原因我猜测?

至于模型的一部分被运行到cpu上面:当上下文+并行数过大,以至于需要的显存空间(可能是因为KVcache上下文预分配导致的,当上下文较大的时候,显存其实主要是被上下文占用,而非模型参数本身)大于显卡的显存的时候,ollama会停止当前实例并重新拉起实例,自动将部分模型部署到内存中,然后GPU/CPU协同推理。这是ollama为了避免炸显存导致推理不起来的特性,并非bug。

如果只有很少一部分需要加载到cpu,并且你仍旧希望使用GPU完成所有推理(基于共享显存方案,Windows下可用,Linux不知道),可以参考这个:#8509。这可以避免cpu参与推理计算(但是cpu仍旧需要操作内存到显存的数据搬移,占用不会完全没有)(此外,如果cpu加载的部分很大的情况下,gpu/cpu混合推理应该比单GPU+共享显存快不少)

<!-- gh-comment-id:2753539244 --> @NGC13009 commented on GitHub (Mar 26, 2025): 我猜你指的是:调用不同ctx size会导致重启模型吗? 之前我遇见过类似的问题,ollama一般不会重启模型,除非当前运行的实例所配置的上下文小于你调用的上下文大小,那么会重新拉起。 你需要确保所有外部调用前端(例如dify,或者类似于 cherry studio,chatbox,page assist,continue之类的)所调用的上下文长度(context length或者ctx size之类的字样)小于等于模型`Modelfile`指定的上下文长度。或者,先由一个更大上下文设置的前端调用并拉起模型,然后再使用其他小上下文调用的模型。否则会触发重新拉起模型。ollama默认的ctx size是2048,需要通过重写`Modelfile`的方式指定模型的上下文大小。你的前端程序如果需要更多上下文,那么会导致ollama重启一个更长上下文的实例,这应该是因为KVcache预分配的原因我猜测? 至于模型的一部分被运行到cpu上面:当上下文+并行数过大,以至于需要的显存空间(可能是因为KVcache上下文预分配导致的,当上下文较大的时候,显存其实主要是被上下文占用,而非模型参数本身)大于显卡的显存的时候,ollama会停止当前实例并重新拉起实例,自动将部分模型部署到内存中,然后GPU/CPU协同推理。这是ollama为了避免炸显存导致推理不起来的特性,并非bug。 如果只有很少一部分需要加载到cpu,并且你仍旧希望使用GPU完成所有推理(基于共享显存方案,Windows下可用,Linux不知道),可以参考这个:[#8509](https://github.com/ollama/ollama/issues/8509#issuecomment-2604552357)。这可以避免cpu参与推理计算(但是cpu仍旧需要操作内存到显存的数据搬移,占用不会完全没有)(此外,如果cpu加载的部分很大的情况下,gpu/cpu混合推理应该比单GPU+共享显存快不少)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6534