[GH-ISSUE #5590] Ollama running requests slow, while not utilizing entire VRAM #3492

Closed
opened 2026-04-12 14:11:06 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @txhno on GitHub (Jul 10, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5590

Originally assigned to: @dhiltgen on GitHub.

I am continuously sending "generate" API requests to the Ollama server, querying the model wizardlm2:7b-q6_K. Each iteration takes approximately 3 seconds to respond. The model is using only 8GB of the 16GB VRAM available on my Tesla V100 GPU.

Is there a way to make it utilize the entire VRAM to speed up request processing? If so, how can I achieve this?

Originally created by @txhno on GitHub (Jul 10, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5590 Originally assigned to: @dhiltgen on GitHub. I am continuously sending "generate" API requests to the Ollama server, querying the model wizardlm2:7b-q6_K. Each iteration takes approximately 3 seconds to respond. The model is using only 8GB of the 16GB VRAM available on my Tesla V100 GPU. Is there a way to make it utilize the entire VRAM to speed up request processing? If so, how can I achieve this?
GiteaMirror added the question label 2026-04-12 14:11:06 -05:00
Author
Owner

@HeroSong666 commented on GitHub (Jul 11, 2024):

I had a similar problem. I used four A30 gpus to reason about qwen2-72b's model. But even at peak times, each card was not used more than 35%. At the same time, the speed of reasoning is relatively slow.

<!-- gh-comment-id:2221810394 --> @HeroSong666 commented on GitHub (Jul 11, 2024): I had a similar problem. I used four A30 gpus to reason about qwen2-72b's model. But even at peak times, each card was not used more than 35%. At the same time, the speed of reasoning is relatively slow.
Author
Owner

@dhiltgen commented on GitHub (Jul 23, 2024):

@txhno that VRAM usage sounds correct for that model. There isn't a way to make a model use more memory to make it go faster. I don't have a V100, but on an RTX 2080 Ti which is a similar generation, I see an initial load take ~3 seconds, and subsequent requests to the already loaded model are ~0.5s for a short prompt with a token rate of ~65 tokens per second.

Concurrency and parallelism will allow you to send multiple requests at the same time, which can help on aggregate throughput, but you'll need multiple concurrent client requests.

With concurrency, you can load a second model (we don't support loading the same model twice) but you may bump up against PCI bus throughput, CPU performance limits, etc.

If the performance you're seeing is significantly slower than what I described above, please share more details, and your server log and I'll reopen the issue so we can look into it more.

<!-- gh-comment-id:2246470394 --> @dhiltgen commented on GitHub (Jul 23, 2024): @txhno that VRAM usage sounds correct for that model. There isn't a way to make a model use more memory to make it go faster. I don't have a V100, but on an RTX 2080 Ti which is a similar generation, I see an initial load take ~3 seconds, and subsequent requests to the already loaded model are ~0.5s for a short prompt with a token rate of ~65 tokens per second. Concurrency and parallelism will allow you to send multiple requests at the same time, which can help on aggregate throughput, but you'll need multiple concurrent client requests. With concurrency, you can load a second model (we don't support loading the same model twice) but you may bump up against PCI bus throughput, CPU performance limits, etc. If the performance you're seeing is significantly slower than what I described above, please share more details, and your server log and I'll reopen the issue so we can look into it more.
Author
Owner

@txhno commented on GitHub (Jul 24, 2024):

@dhiltgen Can I create a model file of the same model, perhaps increase the context size a bit, and then run concurrency with the modified model?

<!-- gh-comment-id:2246903082 --> @txhno commented on GitHub (Jul 24, 2024): @dhiltgen Can I create a model file of the same model, perhaps increase the context size a bit, and then run concurrency with the modified model?
Author
Owner

@dhiltgen commented on GitHub (Jul 26, 2024):

@txhno unfortunately no, parameters do not affect the weight layer(s) in the model, so the uniqueness algorithm will treat them as the same model.

<!-- gh-comment-id:2253254567 --> @dhiltgen commented on GitHub (Jul 26, 2024): @txhno unfortunately no, parameters do not affect the weight layer(s) in the model, so the uniqueness algorithm will treat them as the same model.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3492