[GH-ISSUE #4198] Improving the efficiency of using multiple GPU cards. #49123

Closed
opened 2026-04-28 10:46:20 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @zhqfdn on GitHub (May 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4198

Originally assigned to: @dhiltgen on GitHub.

Before v0.1.32, when loading a model, it would be evenly distributed across all GPU cards to improve the use of GPU cards. In v0.1.32 and v0.1.33, it was found that loading a model would automatically use one card.

When used by multiple users simultaneously, it is slower. If evenly distributed across multiple GPU cards, it can improve the utilization rate of GPU cards and improve efficiency.

Tesla T4 GPU list

localhost.localdomain Mon May 6 18:41:30 2024 550.54.15
[0] Tesla T4 | 54°C, 93 % | 12238 / 15360 MB | ollama(12236M)
[1] Tesla T4 | 36°C, 0 % | 2 / 15360 MB |
[2] Tesla T4 | 30°C, 0 % | 2 / 15360 MB |
[3] Tesla T4 | 33°C, 0 % | 2 / 15360 MB |

ollama.service

`[Unit]
Description=Ollama Service
After=network-online.target

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS='*'"
Environment="OLLAMA_MODELS=/ollama/ollama/models"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/root/.local/bin:/root/bin:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"

[Install]
WantedBy=default.target`

Originally created by @zhqfdn on GitHub (May 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4198 Originally assigned to: @dhiltgen on GitHub. Before v0.1.32, when loading a model, it would be evenly distributed across all GPU cards to improve the use of GPU cards. In v0.1.32 and v0.1.33, it was found that loading a model would automatically use one card. When used by multiple users simultaneously, it is slower. If evenly distributed across multiple GPU cards, it can improve the utilization rate of GPU cards and improve efficiency. Tesla T4 GPU list ------------------------------------------------------------------- localhost.localdomain Mon May 6 18:41:30 2024 550.54.15 **[0] Tesla T4 | 54°C, 93 % | 12238 / 15360 MB | ollama(12236M)** [1] Tesla T4 | 36°C, 0 % | 2 / 15360 MB | [2] Tesla T4 | 30°C, 0 % | 2 / 15360 MB | [3] Tesla T4 | 33°C, 0 % | 2 / 15360 MB | ollama.service ------------------------------------------------------------------- `[Unit] Description=Ollama Service After=network-online.target [Service] Environment="OLLAMA_HOST=0.0.0.0:11434" Environment="OLLAMA_ORIGINS='*'" Environment="OLLAMA_MODELS=/ollama/ollama/models" Environment="OLLAMA_KEEP_ALIVE=10m" **Environment="OLLAMA_NUM_PARALLEL=4"** Environment="OLLAMA_MAX_LOADED_MODELS=2" Environment="CUDA_VISIBLE_DEVICES=0,1,2,3" ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=/root/.local/bin:/root/bin:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin" [Install] WantedBy=default.target`
GiteaMirror added the feature request label 2026-04-28 10:46:20 -05:00
Author
Owner

@zhqfdn commented on GitHub (May 6, 2024):

Environment="OLLAMA_NUM_PARALLEL=5"
Environment="OLLAMA_MAX_LOADED_MODELS=2"

Three people simultaneously used and loaded llama 3, which was very fast.
The fourth person used to load codememma, which was very slow.

localhost.localdomain Mon May 6 19:12:05 2024 550.54.15
[0] Tesla T4 | 57°C, 94 % | 13790 / 15360 MB | ollama(13788M) ---> llama 3
[1] Tesla T4 | 52°C, 9 % | 13828 / 15360 MB | ollama(13806M) ---> codememma
[2] Tesla T4 | 31°C, 0 % | 2 / 15360 MB |
[3] Tesla T4 | 34°C, 0 % | 2 / 15360 MB |

<!-- gh-comment-id:2095765492 --> @zhqfdn commented on GitHub (May 6, 2024): Environment="OLLAMA_NUM_PARALLEL=5" Environment="OLLAMA_MAX_LOADED_MODELS=2" Three people simultaneously used and loaded llama 3, which was very fast. The fourth person used to load codememma, which was very slow. localhost.localdomain Mon May 6 19:12:05 2024 550.54.15 [0] Tesla T4 | 57°C, **94 %** | 13790 / 15360 MB | ollama(13788M) ---> llama 3 [1] Tesla T4 | 52°C, **9 %** | 13828 / 15360 MB | ollama(13806M) ---> codememma [2] Tesla T4 | 31°C, 0 % | 2 / 15360 MB | [3] Tesla T4 | 34°C, 0 % | 2 / 15360 MB |
Author
Owner

@kungfu-eric commented on GitHub (May 6, 2024):

Observing this too https://github.com/ollama/ollama/issues/4212. Horrible perf regression

<!-- gh-comment-id:2097056589 --> @kungfu-eric commented on GitHub (May 6, 2024): Observing this too https://github.com/ollama/ollama/issues/4212. Horrible perf regression
Author
Owner

@gaborkukucska commented on GitHub (May 7, 2024):

Additionally, it would be awesome if it could also load across networked GPUs like Petals do.

This would allow communities with older GPUs to combine their vram... especially locally where multiple computers are on a local network eg schools.

<!-- gh-comment-id:2097618732 --> @gaborkukucska commented on GitHub (May 7, 2024): Additionally, it would be awesome if it could also load across networked GPUs like Petals do. This would allow communities with older GPUs to combine their vram... especially locally where multiple computers are on a local network eg schools.
Author
Owner

@kungfu-eric commented on GitHub (May 7, 2024):

Confirmed that downgrading to 0.1.31 resolves this issue for me as per @zhqfdn suggestion

It must be the new GPU detection added post 0.1.31. Should it revert to the cudart GPU detection? Here's the comparison in the logs:

0.1.31

time=2024-05-07T06:54:10.510-07:00 level=INFO source=images.go:804 msg="total blobs: 62"
time=2024-05-07T06:54:10.511-07:00 level=INFO source=images.go:811 msg="total unused blobs removed: 0"
time=2024-05-07T06:54:10.512-07:00 level=INFO source=routes.go:1118 msg="Listening on [::]:7200 (version 0.1.31)"
time=2024-05-07T06:54:10.521-07:00 level=INFO source=payload_common.go:113 msg="Extracting dynamic libraries to /tmp/ollama2237025292/runners ..."
time=2024-05-07T06:54:13.192-07:00 level=INFO source=payload_common.go:140 msg="Dynamic LLM libraries [cuda_v11 rocm_v60000 cpu_avx cpu_avx2 cpu]"
time=2024-05-07T06:54:13.192-07:00 level=INFO source=gpu.go:115 msg="Detecting GPU type"
time=2024-05-07T06:54:13.192-07:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libcudart.so*"
time=2024-05-07T06:54:13.193-07:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/tmp/ollama2237025292/runners/cuda_v11/libcudart.so.11.0 /usr/local/cuda/lib64/libcudart.so.11.7.60]"
time=2024-05-07T06:54:14.328-07:00 level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart"
time=2024-05-07T06:54:14.328-07:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-05-07T06:54:14.969-07:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"

0.1.34rc1

2024/05/07 06:38:53 routes.go:989: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-05-07T06:38:53.154-07:00 level=INFO source=images.go:897 msg="total blobs: 62"
time=2024-05-07T06:38:53.154-07:00 level=INFO source=images.go:904 msg="total unused blobs removed: 0"
time=2024-05-07T06:38:53.155-07:00 level=INFO source=routes.go:1034 msg="Listening on 127.0.0.1:11434 (version 0.1.34-rc1)"
time=2024-05-07T06:38:53.163-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1226851952/runners
time=2024-05-07T06:38:55.816-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60002 cpu cpu_avx]"
time=2024-05-07T06:38:55.816-07:00 level=INFO source=gpu.go:122 msg="Detecting GPUs"
time=2024-05-07T06:38:56.036-07:00 level=INFO source=gpu.go:127 msg="detected GPUs" count=3 library=/usr/lib/x86_64-linux-gnu/libcuda.so.515.43.04
time=2024-05-07T06:38:56.036-07:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

I tested 0.1.33 ga and it had the same issue as per https://github.com/ollama/ollama/issues/4212

<!-- gh-comment-id:2098477590 --> @kungfu-eric commented on GitHub (May 7, 2024): Confirmed that downgrading to 0.1.31 resolves this issue for me as per @zhqfdn suggestion It must be the new GPU detection added post 0.1.31. Should it revert to the cudart GPU detection? Here's the comparison in the logs: 0.1.31 ``` time=2024-05-07T06:54:10.510-07:00 level=INFO source=images.go:804 msg="total blobs: 62" time=2024-05-07T06:54:10.511-07:00 level=INFO source=images.go:811 msg="total unused blobs removed: 0" time=2024-05-07T06:54:10.512-07:00 level=INFO source=routes.go:1118 msg="Listening on [::]:7200 (version 0.1.31)" time=2024-05-07T06:54:10.521-07:00 level=INFO source=payload_common.go:113 msg="Extracting dynamic libraries to /tmp/ollama2237025292/runners ..." time=2024-05-07T06:54:13.192-07:00 level=INFO source=payload_common.go:140 msg="Dynamic LLM libraries [cuda_v11 rocm_v60000 cpu_avx cpu_avx2 cpu]" time=2024-05-07T06:54:13.192-07:00 level=INFO source=gpu.go:115 msg="Detecting GPU type" time=2024-05-07T06:54:13.192-07:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libcudart.so*" time=2024-05-07T06:54:13.193-07:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/tmp/ollama2237025292/runners/cuda_v11/libcudart.so.11.0 /usr/local/cuda/lib64/libcudart.so.11.7.60]" time=2024-05-07T06:54:14.328-07:00 level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart" time=2024-05-07T06:54:14.328-07:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-05-07T06:54:14.969-07:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6" ``` 0.1.34rc1 ``` 2024/05/07 06:38:53 routes.go:989: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]" time=2024-05-07T06:38:53.154-07:00 level=INFO source=images.go:897 msg="total blobs: 62" time=2024-05-07T06:38:53.154-07:00 level=INFO source=images.go:904 msg="total unused blobs removed: 0" time=2024-05-07T06:38:53.155-07:00 level=INFO source=routes.go:1034 msg="Listening on 127.0.0.1:11434 (version 0.1.34-rc1)" time=2024-05-07T06:38:53.163-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1226851952/runners time=2024-05-07T06:38:55.816-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60002 cpu cpu_avx]" time=2024-05-07T06:38:55.816-07:00 level=INFO source=gpu.go:122 msg="Detecting GPUs" time=2024-05-07T06:38:56.036-07:00 level=INFO source=gpu.go:127 msg="detected GPUs" count=3 library=/usr/lib/x86_64-linux-gnu/libcuda.so.515.43.04 time=2024-05-07T06:38:56.036-07:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" ``` I tested 0.1.33 ga and it had the same issue as per https://github.com/ollama/ollama/issues/4212
Author
Owner

@dhiltgen commented on GitHub (May 8, 2024):

Before v0.1.32, when loading a model, it would be evenly distributed across all GPU cards to improve the use of GPU cards. In v0.1.32 and v0.1.33, it was found that loading a model would automatically use one card.

@zhqfdn this was an intentional design change. Based on our performance testing, if a model can fit in one GPU, we saw better performance loading it into 1 instead of unnecessarily splitting it into multiples.

I'll put up a PR to support tuning this behavior, but at least based on my initial testing, it doesn't yield a performance benefit. If you can test and validate you see performance benefit that would help justify merging this.

<!-- gh-comment-id:2101519310 --> @dhiltgen commented on GitHub (May 8, 2024): > Before v0.1.32, when loading a model, it would be evenly distributed across all GPU cards to improve the use of GPU cards. In v0.1.32 and v0.1.33, it was found that loading a model would automatically use one card. @zhqfdn this was an intentional design change. Based on our performance testing, if a model can fit in one GPU, we saw better performance loading it into 1 instead of unnecessarily splitting it into multiples. I'll put up a PR to support tuning this behavior, but at least based on my initial testing, it doesn't yield a performance benefit. If you can test and validate you see performance benefit that would help justify merging this.
Author
Owner

@zhqfdn commented on GitHub (May 27, 2024):

It is meaningful when OLLAMA_NUM_PARALLEL =N is set.
Multiple cards are more suitable than single cards to meet the configuration of OLLAMA_NUM_PARALLEL =100.

在 2024年5月9日 05:35,Daniel @.***> 写道:

Before v0.1.32, when loading a model, it would be evenly distributed across all GPU cards to improve the use of GPU cards. In v0.1.32 and v0.1.33, it was found that loading a model would automatically use one card.
@zhqfdn this was an intentional design change. Based on our performance testing, if a model can fit in one GPU, we saw better performance loading it into 1 instead of unnecessarily splitting it into multiples.
I'll put up a PR to support tuning this behavior, but at least based on my initial testing, it doesn't yield a performance benefit. If you can test and validate you see performance benefit that would help justify merging this.

Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: @.***>

<!-- gh-comment-id:2133415570 --> @zhqfdn commented on GitHub (May 27, 2024): It is meaningful when OLLAMA_NUM_PARALLEL =N is set. Multiple cards are more suitable than single cards to meet the configuration of OLLAMA_NUM_PARALLEL =100. 在 2024年5月9日 05:35,Daniel ***@***.***> 写道: Before v0.1.32, when loading a model, it would be evenly distributed across all GPU cards to improve the use of GPU cards. In v0.1.32 and v0.1.33, it was found that loading a model would automatically use one card. @zhqfdn this was an intentional design change. Based on our performance testing, if a model can fit in one GPU, we saw better performance loading it into 1 instead of unnecessarily splitting it into multiples. I'll put up a PR to support tuning this behavior, but at least based on my initial testing, it doesn't yield a performance benefit. If you can test and validate you see performance benefit that would help justify merging this. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>
Author
Owner

@dmatora commented on GitHub (Sep 9, 2024):

experiencing similar issue
how can I downgrade ollama ?

<!-- gh-comment-id:2338813025 --> @dmatora commented on GitHub (Sep 9, 2024): experiencing similar issue how can I downgrade ollama ?
Author
Owner

@dhiltgen commented on GitHub (Sep 9, 2024):

@dmatora sorry to hear your having troubles. If you're running on linux, instructions are here https://github.com/ollama/ollama/blob/main/docs/linux.md#installing-specific-versions If you're running on Mac or Windows, you can download the installers for older versions from the release page on github. If you're using a conatainer, we tag every image with the release version. https://hub.docker.com/r/ollama/ollama/tags

Please make sure there's an issue tracking your problem so we can investigate.

<!-- gh-comment-id:2338959014 --> @dhiltgen commented on GitHub (Sep 9, 2024): @dmatora sorry to hear your having troubles. If you're running on linux, instructions are here https://github.com/ollama/ollama/blob/main/docs/linux.md#installing-specific-versions If you're running on Mac or Windows, you can download the installers for older versions from the release page on github. If you're using a conatainer, we tag every image with the release version. https://hub.docker.com/r/ollama/ollama/tags Please make sure there's an issue tracking your problem so we can investigate.
Author
Owner

@dmatora commented on GitHub (Sep 9, 2024):

@dhiltgen I felt like creating one more issue on same subject would be wrong so I posted details at #4517
But I guess downgrading won't help, testing vLLM and exllama instead

<!-- gh-comment-id:2339214431 --> @dmatora commented on GitHub (Sep 9, 2024): @dhiltgen I felt like creating one more issue on same subject would be wrong so I posted details at #4517 But I guess downgrading won't help, testing vLLM and exllama instead
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49123