[GH-ISSUE #3493] [WIN11] Ollama extremely slow with Command-r 35b and 3 RTX 4090 #48662

Closed
opened 2026-04-28 09:03:05 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @GlobalAIVision on GitHub (Apr 4, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3493

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Issue: Ollama is really slow (2.70 tokens per second) even i have 3 RTX 4090 and a I9 14900K CPU.

What did you expect to see?

I exptected to see faster tokens generation for a 35b model on 3 RTX 4090.

Steps to reproduce

I have Cuda Toolkit 12.4. On startup, i see this:
time=2024-04-04T17:58:24.004+02:00 level=INFO source=payload_common.go:140 msg="Dynamic LLM libraries [cpu_avx rocm_v5.7 cpu_avx2 cpu cuda_v11.3]"

The 41/41 Layers are interely on the GPU.
When generating, i obtain this:
{"function":"print_timings","level":"INFO","line":286,"msg":"generation eval time = 28903.17 ms / 78 runs ( 370.55 ms per token, 2.70 tokens per second)","n_decoded":78,"n_tokens_second":2.6986661116179373,"slot_id":0,"t_token":370.5534358974359,"t_token_generation":28903.168,"task_id":148,"tid":"11876","timestamp":1712246421}
This is really slow. I'm also using OpenWebUI for generation.

May the problem be the difference in cuda version? (I have 12.4 and ollama says that loads dll for 11.3)?

Are there any recent changes that introduced the issue?

No response

OS

Windows

Architecture

amd64

Platform

No response

Ollama version

0.1.30

GPU

Intel

GPU info

3 X RTX 4090

CPU

Intel

Other software

No response

Originally created by @GlobalAIVision on GitHub (Apr 4, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3493 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Issue: Ollama is really slow (2.70 tokens per second) even i have 3 RTX 4090 and a I9 14900K CPU. ### What did you expect to see? I exptected to see faster tokens generation for a 35b model on 3 RTX 4090. ### Steps to reproduce I have Cuda Toolkit 12.4. On startup, i see this: `time=2024-04-04T17:58:24.004+02:00 level=INFO source=payload_common.go:140 msg="Dynamic LLM libraries [cpu_avx rocm_v5.7 cpu_avx2 cpu cuda_v11.3]"` The 41/41 Layers are interely on the GPU. When generating, i obtain this: `{"function":"print_timings","level":"INFO","line":286,"msg":"generation eval time = 28903.17 ms / 78 runs ( 370.55 ms per token, 2.70 tokens per second)","n_decoded":78,"n_tokens_second":2.6986661116179373,"slot_id":0,"t_token":370.5534358974359,"t_token_generation":28903.168,"task_id":148,"tid":"11876","timestamp":1712246421}` This is really slow. I'm also using OpenWebUI for generation. May the problem be the difference in cuda version? (I have 12.4 and ollama says that loads dll for 11.3)? ### Are there any recent changes that introduced the issue? _No response_ ### OS Windows ### Architecture amd64 ### Platform _No response_ ### Ollama version 0.1.30 ### GPU Intel ### GPU info 3 X RTX 4090 ### CPU Intel ### Other software _No response_
GiteaMirror added the bugwindows labels 2026-04-28 09:03:41 -05:00
Author
Owner

@GlobalAIVision commented on GitHub (Apr 5, 2024):

UPDATE: Using cuda 11.3, the problem was solved.
Now, i have CUDA Memory problem after the third answer (maybe the context is too big?)

<!-- gh-comment-id:2039522827 --> @GlobalAIVision commented on GitHub (Apr 5, 2024): UPDATE: Using cuda 11.3, the problem was solved. Now, i have CUDA Memory problem after the third answer (maybe the context is too big?)
Author
Owner

@dhiltgen commented on GitHub (Jun 1, 2024):

@GlobalAIVision can you clarify? Are you getting an OOM crash? Can you share the server log from around the time of the crash? (Please make sure to upgrade to the latest version as we've continued to improve our memory prediction logic which may resolve your OOM crash)

<!-- gh-comment-id:2143616339 --> @dhiltgen commented on GitHub (Jun 1, 2024): @GlobalAIVision can you clarify? Are you getting an OOM crash? Can you share the server log from around the time of the crash? (Please make sure to upgrade to the latest version as we've continued to improve our memory prediction logic which may resolve your OOM crash)
Author
Owner

@dhiltgen commented on GitHub (Jun 22, 2024):

If you're still seeing OOM crashes, please make sure to upgrade to 0.1.45 which has fixes to help our prediction, so this should be resolved. If it still hits an OOM, please share an updated log and I'll re-open the issue.

<!-- gh-comment-id:2183595800 --> @dhiltgen commented on GitHub (Jun 22, 2024): If you're still seeing OOM crashes, please make sure to upgrade to 0.1.45 which has fixes to help our prediction, so this should be resolved. If it still hits an OOM, please share an updated log and I'll re-open the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48662