[GH-ISSUE #10617] V100s not Being Used #6985

Closed
opened 2026-04-12 18:52:54 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @sempervictus on GitHub (May 8, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10617

What is the issue?

When running any model on the latest release to include a small llama3.2 atop a 4x32G V100 SXM2 setup; ollama always "elects" to use the CPU regardless of OLLAMA_MAX_VRAM setting, OLLAMA_SCHED_SPREAD, CUDA_VISIBLE_DEVICES, etc. It detects the 4 available devices and sees their vRAM capacity during scheduling but does not offload any layers and appears to grossly over-estimate the vRAM required for a small model.

Relevant docker params currently in use (iterating through toggling/adjusting them seems to do nothing):

  -e "OLLAMA_KEEP_ALIVE=30m" \
  -e "CUDA_VISIBLE_DEVICES=0,1,2,3" \
  -e "OLLAMA_NUM_PARALLEL=4" \
  -e "OLLAMA_SCHED_SPREAD=1" \
  -e "OLLAMA_CONTEXT_LENGTH=32768" \
  -e "OLLAMA_MAX_LOADED_MODELS=8" \
  -e "OLLAMA_MAX_VRAM=137438953472" \
  -e "OLLAMA_NEW_ENGINE=1" \

Relevant log output

A llama3.2 load:


time=2025-05-08T05:43:13.173Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=4 layers.model=65 layers.offload=0 layers.split="" memory.available="[31.1 GiB 31.4 GiB 31.4 GiB 31.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="82.1 GiB" memory.required.partial="0 B" memory.required.kv="16.0 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B]" memory.weights.total="62.5 GiB" memory.weights.repeating="60.1 GiB" memory.weights.nonrepeating="2.4 GiB" memory.graph.full="32.0 GiB" memory.graph.partial="32.0 GiB"
time=2025-05-08T05:43:13.828Z level=INFO source=server.go:409 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /root/.ollama/models/blobs/sha256-ffd0081a97182da52ef3c58dcafde851cbd436ce82f71fc5ed9973828bf78a8f --ctx-size 65536 --batch-size 4096 --n-gpu-layers 4 --threads 8 --mlock --parallel 4 --port 41651"

This machine has a total of 128G of vRAM and with SCHED_SPREAD one would presume it would be able to utilize them at least partially for model inference.

While the llama3.2 task is running, gpustat shows no use other than comfyui in the background:

[0] Tesla V100-SXM2-32GB | 41'C,  ?? %,   0 %,   61 / 300 W |   583 / 32768 MB | root:python3/3964814(306M)
[1] Tesla V100-SXM2-32GB | 35'C,  ?? %,   0 %,   44 / 300 W |   277 / 32768 MB |
[2] Tesla V100-SXM2-32GB | 34'C,  ?? %,   0 %,   44 / 300 W |   277 / 32768 MB |
[3] Tesla V100-SXM2-32GB | 36'C,  ?? %,   0 %,   43 / 300 W |   277 / 32768 MB |

GPUs are detected:

time=2025-05-08T05:42:16.776Z level=INFO source=types.go:130 msg="inference compute" id=GPU-UUID-1 library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
time=2025-05-08T05:42:16.776Z level=INFO source=types.go:130 msg="inference compute" id=GPU-UUID-2 library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
time=2025-05-08T05:42:16.776Z level=INFO source=types.go:130 msg="inference compute" id=GPU-UUID-3 library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
time=2025-05-08T05:42:16.776Z level=INFO source=types.go:130 msg="inference compute" id=GPU-UUID-4 library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.6.7

Originally created by @sempervictus on GitHub (May 8, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10617 ### What is the issue? When running any model on the latest release to include a small llama3.2 atop a 4x32G V100 SXM2 setup; ollama always "elects" to use the CPU regardless of `OLLAMA_MAX_VRAM` setting, `OLLAMA_SCHED_SPREAD`, `CUDA_VISIBLE_DEVICES`, etc. It detects the 4 available devices and sees their vRAM capacity during scheduling but does not offload any layers and appears to grossly over-estimate the vRAM required for a small model. Relevant docker params currently in use (iterating through toggling/adjusting them seems to do nothing): ``` -e "OLLAMA_KEEP_ALIVE=30m" \ -e "CUDA_VISIBLE_DEVICES=0,1,2,3" \ -e "OLLAMA_NUM_PARALLEL=4" \ -e "OLLAMA_SCHED_SPREAD=1" \ -e "OLLAMA_CONTEXT_LENGTH=32768" \ -e "OLLAMA_MAX_LOADED_MODELS=8" \ -e "OLLAMA_MAX_VRAM=137438953472" \ -e "OLLAMA_NEW_ENGINE=1" \ ``` ### Relevant log output A llama3.2 load: ```shell time=2025-05-08T05:43:13.173Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=4 layers.model=65 layers.offload=0 layers.split="" memory.available="[31.1 GiB 31.4 GiB 31.4 GiB 31.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="82.1 GiB" memory.required.partial="0 B" memory.required.kv="16.0 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B]" memory.weights.total="62.5 GiB" memory.weights.repeating="60.1 GiB" memory.weights.nonrepeating="2.4 GiB" memory.graph.full="32.0 GiB" memory.graph.partial="32.0 GiB" time=2025-05-08T05:43:13.828Z level=INFO source=server.go:409 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /root/.ollama/models/blobs/sha256-ffd0081a97182da52ef3c58dcafde851cbd436ce82f71fc5ed9973828bf78a8f --ctx-size 65536 --batch-size 4096 --n-gpu-layers 4 --threads 8 --mlock --parallel 4 --port 41651" ``` This machine has a total of 128G of vRAM and with SCHED_SPREAD one would presume it would be able to utilize them at least partially for model inference. While the llama3.2 task is running, `gpustat` shows no use other than comfyui in the background: ``` [0] Tesla V100-SXM2-32GB | 41'C, ?? %, 0 %, 61 / 300 W | 583 / 32768 MB | root:python3/3964814(306M) [1] Tesla V100-SXM2-32GB | 35'C, ?? %, 0 %, 44 / 300 W | 277 / 32768 MB | [2] Tesla V100-SXM2-32GB | 34'C, ?? %, 0 %, 44 / 300 W | 277 / 32768 MB | [3] Tesla V100-SXM2-32GB | 36'C, ?? %, 0 %, 43 / 300 W | 277 / 32768 MB | ``` GPUs are detected: ``` time=2025-05-08T05:42:16.776Z level=INFO source=types.go:130 msg="inference compute" id=GPU-UUID-1 library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" time=2025-05-08T05:42:16.776Z level=INFO source=types.go:130 msg="inference compute" id=GPU-UUID-2 library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" time=2025-05-08T05:42:16.776Z level=INFO source=types.go:130 msg="inference compute" id=GPU-UUID-3 library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" time=2025-05-08T05:42:16.776Z level=INFO source=types.go:130 msg="inference compute" id=GPU-UUID-4 library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.7
GiteaMirror added the bug label 2026-04-12 18:52:54 -05:00
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

OLLAMA_MAX_VRAM is not an ollama configuration variable.

time=2025-05-08T05:43:13.173Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=4

The client or the Modelfile has set num_gpu to 4. This is the count of layers, not the count of devices. Don't set this, it is calculated by ollama.

Your config variables and log output don't match: you have set OLLAMA_CONTEXT_LENGTH=32768 and OLLAMA_NUM_PARALLEL=4 which normally means that --ctx-size should be 131072 yet the log line shows 65536.

You indicate that you are loading llama3.2 but the sha256 is from command-a:111b-03-2025-q4_K_M.

num_batch has been increased to 4096. I suspect that this combined with large context and parallelism results in ollama not being able to fit a contiguous data structure on any device, falling back to CPU.

A full log may aid in debugging.

<!-- gh-comment-id:2862799650 --> @rick-github commented on GitHub (May 8, 2025): `OLLAMA_MAX_VRAM` is not an ollama configuration variable. ``` time=2025-05-08T05:43:13.173Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=4 ``` The client or the Modelfile has set `num_gpu` to 4. This is the count of layers, not the count of devices. Don't set this, it is calculated by ollama. Your config variables and log output don't match: you have set `OLLAMA_CONTEXT_LENGTH=32768` and `OLLAMA_NUM_PARALLEL=4` which normally means that `--ctx-size` should be 131072 yet the log line shows 65536. You indicate that you are loading llama3.2 but the sha256 is from command-a:111b-03-2025-q4_K_M. `num_batch` has been increased to 4096. I suspect that this combined with large context and parallelism results in ollama not being able to fit a contiguous data structure on any device, falling back to CPU. A full log may aid in debugging.
Author
Owner

@sempervictus commented on GitHub (May 8, 2025):

Thanks, saw the command-a thing in the hash lookup as well and trying to figure out why openwebui is loading that when i'm clearly asking another model (lots of them) to do things.

the VRAM thing is a real param - pulled that from sources trying to figure this out.

Iterating the various combinations i did finally get it to run parallel work:

[0] Tesla V100-SXM2-32GB | 42'C,  ?? %,   0 %,   61 / 300 W | 21339 / 32768 MB | root:python3/32402(306M) root:ollama/155396(10362M) root:ollama/159476(5464M) root:ollama/159591(4928M)
[1] Tesla V100-SXM2-32GB | 39'C,  ?? %,   0 %,   58 / 300 W | 20345 / 32768 MB | root:ollama/155396(10084M) root:ollama/159476(4430M) root:ollama/159591(5552M)
[2] Tesla V100-SXM2-32GB | 37'C,  ?? %,   0 %,   58 / 300 W | 15115 / 32768 MB | root:ollama/155396(9632M) root:ollama/159476(4898M) root:ollama/159591(306M)
[3] Tesla V100-SXM2-32GB | 41'C,  ?? %,   0 %,   58 / 300 W | 21109 / 32768 MB | root:ollama/155396(9632M) root:ollama/159476(4434M) root:ollama/159591(6764M)

using

  -e "CUDA_VISIBLE_DEVICES=0,1,2,3" \
  -e "OLLAMA_SCHED_SPREAD=1" \
  -e "OLLAMA_CONTEXT_LENGTH=32768" \
  -e "OLLAMA_MAX_LOADED_MODELS=8" \
  -e "OLLAMA_MAX_VRAM=137438953472" \

i will peel back the last three options on the next restart as my guess at this point is that its vis devices + sched spread which allow those to load.

Next need to figure out why openwebui stalls for ages and fails to return results even though it loads models and starts processing before faceplanting.

<!-- gh-comment-id:2863617301 --> @sempervictus commented on GitHub (May 8, 2025): Thanks, saw the command-a thing in the hash lookup as well and trying to figure out why openwebui is loading that when i'm clearly asking another model (lots of them) to do things. the VRAM thing is a [real](https://github.com/ollama/ollama/blob/fa9973cd7f51a662226490a853b38c2cfa602a80/envconfig/config.go#L213) param - pulled that from sources trying to figure this out. Iterating the various combinations i did finally get it to run parallel work: ``` [0] Tesla V100-SXM2-32GB | 42'C, ?? %, 0 %, 61 / 300 W | 21339 / 32768 MB | root:python3/32402(306M) root:ollama/155396(10362M) root:ollama/159476(5464M) root:ollama/159591(4928M) [1] Tesla V100-SXM2-32GB | 39'C, ?? %, 0 %, 58 / 300 W | 20345 / 32768 MB | root:ollama/155396(10084M) root:ollama/159476(4430M) root:ollama/159591(5552M) [2] Tesla V100-SXM2-32GB | 37'C, ?? %, 0 %, 58 / 300 W | 15115 / 32768 MB | root:ollama/155396(9632M) root:ollama/159476(4898M) root:ollama/159591(306M) [3] Tesla V100-SXM2-32GB | 41'C, ?? %, 0 %, 58 / 300 W | 21109 / 32768 MB | root:ollama/155396(9632M) root:ollama/159476(4434M) root:ollama/159591(6764M) ``` using ``` -e "CUDA_VISIBLE_DEVICES=0,1,2,3" \ -e "OLLAMA_SCHED_SPREAD=1" \ -e "OLLAMA_CONTEXT_LENGTH=32768" \ -e "OLLAMA_MAX_LOADED_MODELS=8" \ -e "OLLAMA_MAX_VRAM=137438953472" \ ``` i will peel back the last three options on the next restart as my guess at this point is that its vis devices + sched spread which allow those to load. Next need to figure out why openwebui stalls for ages and fails to return results even though it loads models and _starts_ processing before faceplanting.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6985