[GH-ISSUE #4497] Ollama 0.1.38 has high video memory usage and runs very slowly. #2816

Closed
opened 2026-04-12 13:08:50 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @chenwei0930 on GitHub (May 17, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4497

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I am using Windows 10 with an NVIDIA 2080Ti graphics card that has 22GB of video memory. I upgraded from version 0.1.32 to 0.1.38 with the goal of supporting loading multiple models and handling multiple concurrent requests. However, I noticed that under version 0.1.38, the video memory usage is very high, and the speed has become much slower.

I am using the "codeqwen:7b-chat-v1.5-q8_0" model. Under version 0.1.32, it used around 8GB of video memory and output approximately 10 tokens per second. However, under version 0.1.38, it is using 18.8GB of video memory, and based on my observation, it is only outputting 1-2 tokens per second.

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.1.38

Originally created by @chenwei0930 on GitHub (May 17, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4497 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I am using Windows 10 with an NVIDIA 2080Ti graphics card that has 22GB of video memory. I upgraded from version 0.1.32 to 0.1.38 with the goal of supporting loading multiple models and handling multiple concurrent requests. However, I noticed that under version 0.1.38, the video memory usage is very high, and the speed has become much slower. I am using the "codeqwen:7b-chat-v1.5-q8_0" model. Under version 0.1.32, it used around 8GB of video memory and output approximately 10 tokens per second. However, under version 0.1.38, it is using 18.8GB of video memory, and based on my observation, it is only outputting 1-2 tokens per second. ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.38
GiteaMirror added the bug label 2026-04-12 13:08:50 -05:00
Author
Owner

@oldgithubman commented on GitHub (May 18, 2024):

Can confirm 0.1.38 seems to want more video memory

<!-- gh-comment-id:2119015749 --> @oldgithubman commented on GitHub (May 18, 2024): Can confirm 0.1.38 seems to want more video memory
Author
Owner

@dhiltgen commented on GitHub (May 21, 2024):

@chenwei0930 you mention enabling concurrency... what settings are you using? In particular, when you set OLLAMA_NUM_PARALLEL we have to multiply the context by that number, and it looks like this model has a default context size of 8192, so if you set a large parallel factor that might explain what you're seeing. I wouldn't expect to see a drop in token rate for a single request though. Perhaps ollama ps will help shed some light? Failing that, can you share server logs so we can see what might be going on?

<!-- gh-comment-id:2123423260 --> @dhiltgen commented on GitHub (May 21, 2024): @chenwei0930 you mention enabling concurrency... what settings are you using? In particular, when you set OLLAMA_NUM_PARALLEL we have to multiply the context by that number, and it looks like this model has a default context size of 8192, so if you set a large parallel factor that might explain what you're seeing. I wouldn't expect to see a drop in token rate for a single request though. Perhaps `ollama ps` will help shed some light? Failing that, can you share server logs so we can see what might be going on?
Author
Owner

@Kyncc commented on GitHub (May 27, 2024):

Do you use option "size of content"? When I modify this option, the GPU memory drops

<!-- gh-comment-id:2133108424 --> @Kyncc commented on GitHub (May 27, 2024): Do you use option "size of content"? When I modify this option, the GPU memory drops
Author
Owner

@dhiltgen commented on GitHub (Jun 21, 2024):

If you're still seeing unexpected memory usage, please share more details about your setup and I'll re-open the issue.

<!-- gh-comment-id:2183581097 --> @dhiltgen commented on GitHub (Jun 21, 2024): If you're still seeing unexpected memory usage, please share more details about your setup and I'll re-open the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2816