[GH-ISSUE #1637] Slowness #47425

Closed
opened 2026-04-28 03:46:21 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @Pblmr on GitHub (Dec 20, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1637

Running ollama on a DELL with 12*2 Intel Xeon CPU Silver 4214R with 64 GB of RAM with Ubuntu 22.04 but generally, it runs quite slow (nothing like what we can see in the real time demos).
I don't have a GPU. I tried mainly llama2 (latest/default), all default parameters (It's using 24GB of RAM)

What are the ways to make it faster ? It seems getting a GPU would be the best way but is there any other quick win ?

Additionally, I think the model storage location management should be improved. Others have also reported issue with the OLLAMA_MODELS. I think now there can be conflict between the unix service / the user running the service and/or starting locally ? Something needs to be improved most likely.

Originally created by @Pblmr on GitHub (Dec 20, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1637 Running ollama on a DELL with 12*2 Intel Xeon CPU Silver 4214R with 64 GB of RAM with Ubuntu 22.04 but generally, it runs quite slow (nothing like what we can see in the real time demos). I don't have a GPU. I tried mainly llama2 (latest/default), all default parameters (It's using 24GB of RAM) What are the ways to make it faster ? It seems getting a GPU would be the best way but is there any other quick win ? Additionally, I think the model storage location management should be improved. Others have also reported issue with the OLLAMA_MODELS. I think now there can be conflict between the unix service / the user running the service and/or starting locally ? Something needs to be improved most likely.
Author
Owner

@yashchittora commented on GitHub (Dec 20, 2023):

Try running some other models like mistral or orca. I also tried Llama2 models , seems like they tend to have worse performance than other models on same hardware.

<!-- gh-comment-id:1864857289 --> @yashchittora commented on GitHub (Dec 20, 2023): Try running some other models like mistral or orca. I also tried Llama2 models , seems like they tend to have worse performance than other models on same hardware.
Author
Owner

@Pblmr commented on GitHub (Dec 20, 2023):

I am trying dolphin-mixtral:latest. It takes 45 seconds to load, and then simple sample questions take around 30 seconds. Programming questions (simple) take about 2 minutes. I am wondering how to speed this up.
dolphin-mixtral is likely quite big compared to mistral or orca.

<!-- gh-comment-id:1864877330 --> @Pblmr commented on GitHub (Dec 20, 2023): I am trying dolphin-mixtral:latest. It takes 45 seconds to load, and then simple sample questions take around 30 seconds. Programming questions (simple) take about 2 minutes. I am wondering how to speed this up. dolphin-mixtral is likely quite big compared to mistral or orca.
Author
Owner

@yashchittora commented on GitHub (Dec 20, 2023):

Mixtral is a fairly big model. Having more memory and a GPU may improve performance very much. Considering you already have a high amount if RAM / memory , i guess having a GPU may help. Try running it on a PC with these specs first to ensure that it really is happening due to GPU. You may try a cloud based computer with desired Specs to simulate the specs you may have on your PC and then test the models you want to run.

<!-- gh-comment-id:1864882836 --> @yashchittora commented on GitHub (Dec 20, 2023): Mixtral is a fairly big model. Having more memory and a GPU may improve performance very much. Considering you already have a high amount if RAM / memory , i guess having a GPU may help. Try running it on a PC with these specs first to ensure that it really is happening due to GPU. You may try a cloud based computer with desired Specs to simulate the specs you may have on your PC and then test the models you want to run.
Author
Owner

@Pblmr commented on GitHub (Dec 20, 2023):

Good idea. Thanks. Perhaps it should be emphasised more how much a GPU can make a difference and why.
For the training part, it's very clear and widely known/understood. But for using the model, less so. There might be other things to try like using different level of quantization, finding where the slowness come from, etc...They are quite a few tuning parameters. Would be interesting to have more infos on them.

<!-- gh-comment-id:1865094897 --> @Pblmr commented on GitHub (Dec 20, 2023): Good idea. Thanks. Perhaps it should be emphasised more how much a GPU can make a difference and why. For the training part, it's very clear and widely known/understood. But for using the model, less so. There might be other things to try like using different level of quantization, finding where the slowness come from, etc...They are quite a few tuning parameters. Would be interesting to have more infos on them.
Author
Owner

@vtrenton commented on GitHub (Dec 21, 2023):

A few things that may be worth noting is with a system running Xeon you probably have ECC memory which is slower (because of error checking). Faster memory is really critical for LLM workloads and it looks like your CPU supports a max of 2400Mhz. Also looking at the base clock speed of the machine - while you have a decent number of threads to parallelize I'm willing to bet they are pegged at 100% the entire time. According to intel's site - it looks like it pegs at about 3.5GHz: https://www.intel.com/content/www/us/en/products/sku/197100/intel-xeon-silver-4214r-processor-16-5m-cache-2-40-ghz/specifications.html. Of course a GPU is an entirely different type of compute unit that would probably perform better than the CPU at LLM tasks. I noticed a serious performance gain adding in a 3090 on llama2 and mixtral myself so I can vouch for it.

<!-- gh-comment-id:1865318022 --> @vtrenton commented on GitHub (Dec 21, 2023): A few things that may be worth noting is with a system running Xeon you probably have ECC memory which is slower (because of error checking). Faster memory is really critical for LLM workloads and it looks like your CPU supports a max of 2400Mhz. Also looking at the base clock speed of the machine - while you have a decent number of threads to parallelize I'm willing to bet they are pegged at 100% the entire time. According to intel's site - it looks like it pegs at about 3.5GHz: https://www.intel.com/content/www/us/en/products/sku/197100/intel-xeon-silver-4214r-processor-16-5m-cache-2-40-ghz/specifications.html. Of course a GPU is an entirely different type of compute unit that would probably perform better than the CPU at LLM tasks. I noticed a serious performance gain adding in a 3090 on llama2 and mixtral myself so I can vouch for it.
Author
Owner

@pdevine commented on GitHub (Dec 21, 2023):

I definitely recommend getting a GPU, although if you're going to add one to an existing server, make certain you get one which can actually fit into your case. The newer 4090s and Radeon 7900 XTXs are a lot bigger than older GPUs that you may be familiar with (and you may also have to deal with the dreaded GPU sag).

With an GeForce 4090 you can expect to get about ~120-140 tokens/sec w/ a 7b 4bit quantized model. The 7900 XTX can do about 100 tokens/sec with the same model (and is a lot cheaper and will be supported by Ollama soon). You'll also need to make certain your motherboard has a free 16x PCIe lane to get the best performance with either card.

I'm going to go ahead and close the issue.

<!-- gh-comment-id:1865346599 --> @pdevine commented on GitHub (Dec 21, 2023): I definitely recommend getting a GPU, although if you're going to add one to an existing server, make certain you get one which can actually fit into your case. The newer 4090s and Radeon 7900 XTXs are a lot bigger than older GPUs that you may be familiar with (and you may also have to deal with the dreaded GPU sag). With an GeForce 4090 you can expect to get about ~120-140 tokens/sec w/ a 7b 4bit quantized model. The 7900 XTX can do about 100 tokens/sec with the same model (and is a lot cheaper and will be supported by Ollama soon). You'll also need to make certain your motherboard has a free 16x PCIe lane to get the best performance with either card. I'm going to go ahead and close the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47425