模型大小和显卡内存占用空间不符合 #6967

Closed
opened 2025-11-12 13:51:25 -06:00 by GiteaMirror · 12 comments
Owner

Originally created by @Dudu0831 on GitHub (May 6, 2025).

What is the issue?

会导致看起来还有几个G的显卡内存,但是ollama会启用cpu推理
Image

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @Dudu0831 on GitHub (May 6, 2025). ### What is the issue? 会导致看起来还有几个G的显卡内存,但是ollama会启用cpu推理 ![Image](https://github.com/user-attachments/assets/711ed9da-d14a-4f1b-a658-703f6dd36867) ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2025-11-12 13:51:25 -06:00
Author
Owner

@rick-github commented on GitHub (May 6, 2025):

ollama memory estimation sometimes over-estimates, causing unnecessary layer spilling. You can override ollama by specifying the number of layers in num_gpu. Note that depending on OS/drivers, this can cause OOMs or performance issues.

@rick-github commented on GitHub (May 6, 2025): ollama memory estimation sometimes over-estimates, causing unnecessary layer spilling. You can override ollama by specifying the number of layers in [`num_gpu`](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650). Note that depending on OS/drivers, this can cause OOMs or [performance issues](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900).
Author
Owner

@Dudu0831 commented on GitHub (May 7, 2025):

@rick-github 谢谢你的回答,但是我不太懂这个参数如何设置,我设置了
.chatOptions(OllamaOptions.builder()
.model("qwen3:30b")
.numCtx(8000)
.numGPU(1)
.topK(20)
.minP(0.0)
.temperature(0.6)
.topP(0.95)
.build())
但是我得到的结果是
NAME ID SIZE PROCESSOR UNTIL
qwen3:30b 2ee832bc15b5 27 GB 80%/20% CPU/GPU 4 minutes from now
我想让全部都跑在Gpu上应该如何设置

@Dudu0831 commented on GitHub (May 7, 2025): @rick-github 谢谢你的回答,但是我不太懂这个参数如何设置,我设置了 .chatOptions(OllamaOptions.builder() .model("qwen3:30b") .numCtx(8000) .numGPU(1) .topK(20) .minP(0.0) .temperature(0.6) .topP(0.95) .build()) 但是我得到的结果是 NAME ID SIZE PROCESSOR UNTIL qwen3:30b 2ee832bc15b5 27 GB 80%/20% CPU/GPU 4 minutes from now 我想让全部都跑在Gpu上应该如何设置
Author
Owner

@rick-github commented on GitHub (May 7, 2025):

I'm not familiar with the framework you are using, try .numGPU(49).

@rick-github commented on GitHub (May 7, 2025): I'm not familiar with the framework you are using, try `.numGPU(49)`.
Author
Owner

@Dudu0831 commented on GitHub (May 8, 2025):

@rick-github 好滴,谢谢你

@Dudu0831 commented on GitHub (May 8, 2025): @rick-github 好滴,谢谢你
Author
Owner

@Dudu0831 commented on GitHub (May 8, 2025):

@rick-github 你好我的伙计,我不知道是不是我代码的问题,我这边还是遇见了同样的问题。
5月 08 15:42:46 dell-Precision-3680 ollama[1252840]: time=2025-05-08T15:42:46.964+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=65 layers.model=65 layers.offload=63 layers.split="" memory.available="[23.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="23.8 GiB" memory.required.partial="22.9 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[22.9 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="2.7 GiB" memory.graph.partial="2.7 GiB"
5月 08 15:42:46 dell-Precision-3680 ollama[1252840]: llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /data/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest))
我的代码:
response = chat(
model=self.model_name,
messages=[
{"role": "system", "content": "你不要回答用户问题,你只根据用户的问题选择一个最合适的工具即可"},
{"role": "user", "content": full_question}
],
stream=False,
options={"top_p": 0.95, "temperature": 0.6, "top_k": 20, "min_p": 0, "num_gpu": 65, "num_ctx": 2048},
# , "num_ctx":4000

        tools=_get_tool()
    )

我设置了65层加载到显卡,为什么最后还是有两层加载到了cpu上呢?
下面是我显卡的内存使用情况

Image

@Dudu0831 commented on GitHub (May 8, 2025): @rick-github 你好我的伙计,我不知道是不是我代码的问题,我这边还是遇见了同样的问题。 5月 08 15:42:46 dell-Precision-3680 ollama[1252840]: time=2025-05-08T15:42:46.964+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=65 layers.model=65 layers.offload=63 layers.split="" memory.available="[23.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="23.8 GiB" memory.required.partial="22.9 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[22.9 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="2.7 GiB" memory.graph.partial="2.7 GiB" 5月 08 15:42:46 dell-Precision-3680 ollama[1252840]: llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /data/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest)) 我的代码: response = chat( model=self.model_name, messages=[ {"role": "system", "content": "你不要回答用户问题,你只根据用户的问题选择一个最合适的工具即可"}, {"role": "user", "content": full_question} ], stream=False, options={"top_p": 0.95, "temperature": 0.6, "top_k": 20, "min_p": 0, "num_gpu": 65, "num_ctx": 2048}, # , "num_ctx":4000 tools=_get_tool() ) 我设置了65层加载到显卡,为什么最后还是有两层加载到了cpu上呢? 下面是我显卡的内存使用情况 ![Image](https://github.com/user-attachments/assets/96b6b5c3-ae70-40d3-b96f-52380f8db44f)
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

 layers.requested=65 layers.model=65 layers.offload=63

The model has 65 layers. ollama estimated it could offload 63. You requested that it offload 65. Further down in the logs you will see a line with msg="starting llama server" and --n-gpu-layers. The number after that should be 65, indicating that the runner will try to load all 65 layers into VRAM. If it's not, add logs.

@rick-github commented on GitHub (May 8, 2025): ``` layers.requested=65 layers.model=65 layers.offload=63 ``` The model has 65 layers. ollama estimated it could offload 63. You requested that it offload 65. Further down in the logs you will see a line with `msg="starting llama server"` and `--n-gpu-layers`. The number after that should be 65, indicating that the runner will try to load all 65 layers into VRAM. If it's not, add logs.
Author
Owner

@Dudu0831 commented on GitHub (May 9, 2025):

5月 09 09:41:15 dell-Precision-3680 ollama[1252840]: time=2025-05-09T09:41:15.834+08:00 level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /data/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 --ctx-size 8192 --batch-size 512 --n-gpu-layers 65 --threads 8 --parallel 4 --port 43219"
确实我看见了这个,这个代表什么呢

@Dudu0831 commented on GitHub (May 9, 2025): `5月 09 09:41:15 dell-Precision-3680 ollama[1252840]: time=2025-05-09T09:41:15.834+08:00 level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /data/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 --ctx-size 8192 --batch-size 512 --n-gpu-layers 65 --threads 8 --parallel 4 --port 43219"` 确实我看见了这个,这个代表什么呢
Author
Owner

@rick-github commented on GitHub (May 9, 2025):

It means that the runner will load 65 layers into VRAM.

@rick-github commented on GitHub (May 9, 2025): It means that the runner will load 65 layers into VRAM.
Author
Owner

@Dudu0831 commented on GitHub (May 9, 2025):

I see, does this mean all 65 layers are loaded into VRAM? But then why is the CPU also involved in inference when I run ollama ps?

@Dudu0831 commented on GitHub (May 9, 2025): I see, does this mean all 65 layers are loaded into VRAM? But then why is the CPU also involved in inference when I run ollama ps?
Author
Owner

@rick-github commented on GitHub (May 9, 2025):

The CPU is the controller, it is busy sending instructions to the GPU during inference. The output of ollama ps is incorrect because the layer count was overridden - it shows the original estimation, not the results of setting num_gpu.

@rick-github commented on GitHub (May 9, 2025): The CPU is the controller, it is busy sending instructions to the GPU during inference. The output of `ollama ps` is incorrect because the layer count was overridden - it shows the original estimation, not the results of setting `num_gpu`.
Author
Owner

@sunhy0316 commented on GitHub (May 9, 2025):

The CPU is the controller, it is busy sending instructions to the GPU during inference. The output of ollama ps is incorrect because the layer count was overridden - it shows the original estimation, not the results of setting num_gpu.

Would it be possible for ollama ps to reflect actual GPU layer allocation post-override rather than initial estimates?

@sunhy0316 commented on GitHub (May 9, 2025): > The CPU is the controller, it is busy sending instructions to the GPU during inference. The output of `ollama ps` is incorrect because the layer count was overridden - it shows the original estimation, not the results of setting `num_gpu`. Would it be possible for ollama ps to reflect actual GPU layer allocation post-override rather than initial estimates?
Author
Owner

@rick-github commented on GitHub (May 12, 2025):

Would it be possible for ollama ps to reflect actual GPU layer allocation post-override rather than initial estimates?

#7597

@rick-github commented on GitHub (May 12, 2025): > Would it be possible for ollama ps to reflect actual GPU layer allocation post-override rather than initial estimates? #7597
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#6967