[GH-ISSUE #6420] Is the speed of the Olama running model related to the CUDA version? #66074

Closed
opened 2026-05-03 23:51:39 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @TianWuYuJiangHenShou on GitHub (Aug 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6420

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I deployed qwen2:72B with the latest version of Ollama, but I found that the loading speed of Ollama models varies greatly under different nvidia driver version.

driver:535.183.06 | cuda version:12.2

ollama version:0.3.4
Time of Loading Model:29s

driver:515.105.01 | cuda version:11.7

ollama version:0.3.6
Time of Loading Model:659s

GPU :A800

OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

No response

Originally created by @TianWuYuJiangHenShou on GitHub (Aug 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6420 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I deployed qwen2:72B with the latest version of Ollama, but I found that the loading speed of Ollama models varies greatly under different nvidia driver version. # driver:535.183.06 | cuda version:12.2 ollama version:0.3.4 Time of Loading Model:29s # driver:515.105.01 | cuda version:11.7 ollama version:0.3.6 Time of Loading Model:659s ## GPU :A800 ### OS Linux ### GPU Nvidia ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the gpubugnvidianeeds more info labels 2026-05-03 23:51:40 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 19, 2024):

If the model is already in the buffer cache, reloading will be a lot faster. Try tthis:

time curl localhost:11434/api/generate -d '{"model":"qwen2:72b","prompt":"hi","options":{"seed":0},"stream":false,"keep_alive":0}'
sleep 5
time curl localhost:11434/api/generate -d '{"model":"qwen2:72b","prompt":"hi","options":{"seed":0},"stream":false,"keep_alive":0}'

The second command will load the model from buffer cache and will give you a better measure of how long it takes the cuda driver to load the model into the GPU.

<!-- gh-comment-id:2296288197 --> @rick-github commented on GitHub (Aug 19, 2024): If the model is already in the buffer cache, reloading will be a lot faster. Try tthis: ``` time curl localhost:11434/api/generate -d '{"model":"qwen2:72b","prompt":"hi","options":{"seed":0},"stream":false,"keep_alive":0}' sleep 5 time curl localhost:11434/api/generate -d '{"model":"qwen2:72b","prompt":"hi","options":{"seed":0},"stream":false,"keep_alive":0}' ``` The second command will load the model from buffer cache and will give you a better measure of how long it takes the cuda driver to load the model into the GPU.
Author
Owner

@dhiltgen commented on GitHub (Sep 3, 2024):

I wouldn't expect the driver version alone to make that large of a difference.

Can you share your server log for both scenarios so we can see more details?

<!-- gh-comment-id:2327615605 --> @dhiltgen commented on GitHub (Sep 3, 2024): I wouldn't expect the driver version alone to make that large of a difference. Can you share your server log for both scenarios so we can see more details?
Author
Owner

@dhiltgen commented on GitHub (Sep 26, 2024):

If you're still seeing performance problems, please share your server log and I'll reopen the issue.

<!-- gh-comment-id:2375485835 --> @dhiltgen commented on GitHub (Sep 26, 2024): If you're still seeing performance problems, please share your server log and I'll reopen the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66074