[GH-ISSUE #2269] Recommended Spec For Dolphin Mixtral on AWS #63342

Closed
opened 2026-05-03 13:03:21 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @alkali333 on GitHub (Jan 30, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2269

Hi there,

I have been playing around with various models on Amazon EC2 instances, but I'm not too experienced with AWS and I'm not sure what setup is optimal for running dolphin mixtral and other LLMS.

Can anybody recommend an instance that will run it relatively smoothly, or just the specification I need? I've been able to get good performance on some setups but I don't know if I am paying too much.

Thanks

Originally created by @alkali333 on GitHub (Jan 30, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2269 Hi there, I have been playing around with various models on Amazon EC2 instances, but I'm not too experienced with AWS and I'm not sure what setup is optimal for running dolphin mixtral and other LLMS. Can anybody recommend an instance that will run it relatively smoothly, or just the specification I need? I've been able to get good performance on some setups but I don't know if I am paying too much. Thanks
Author
Owner

@orlyandico commented on GitHub (Feb 3, 2024):

Unless your company is paying for your AWS spend, may I suggest hyperstack.cloud ?

They are WAY cheaper than AWS. They have RTX-A6000 Ada Generation with 48GB of GPU memory for $1.10/hour on demand.

The (generally) best bang for the buck AWS GPU instances are g4dn and g5g, which is $0.5260 on-demand for a single-GPU instance with 16GB of RAM. Based on my own benchmarking the A6000 is more than double the performance of the Nvidia T4 in the g4dn when using ollama, so although its 2x the price, you get 2x the performance and 3x the GPU memory.

hyperstack has the cheaper A4000 at $0.43/hour which is cheaper than the T4 g4dn.xlarge and faster (although how much faster, I have not measured)

Stay far, far away from AWS g2, g3 instances (super old) or even the P2/P3. They simply don't have the price-performance.

AWS doesn't have a single-GPU A100 instance, only an 8-GPU and it's $20/hour. Also, A100 and H100 GPU availability is very low.

<!-- gh-comment-id:1925068056 --> @orlyandico commented on GitHub (Feb 3, 2024): Unless your company is paying for your AWS spend, may I suggest hyperstack.cloud ? They are WAY cheaper than AWS. They have RTX-A6000 Ada Generation with 48GB of GPU memory for $1.10/hour on demand. The (generally) best bang for the buck AWS GPU instances are g4dn and g5g, which is $0.5260 on-demand for a single-GPU instance with 16GB of RAM. Based on my own benchmarking the A6000 is more than double the performance of the Nvidia T4 in the g4dn when using ollama, so although its 2x the price, you get 2x the performance and 3x the GPU memory. hyperstack has the cheaper A4000 at $0.43/hour which is cheaper than the T4 g4dn.xlarge and faster (although how much faster, I have not measured) Stay far, far away from AWS g2, g3 instances (super old) or even the P2/P3. They simply don't have the price-performance. AWS doesn't have a single-GPU A100 instance, only an 8-GPU and it's $20/hour. Also, A100 and H100 GPU availability is very low.
Author
Owner

@yvescleuder commented on GitHub (Feb 28, 2024):

Hey @orlyandico,

I have the same doubt, I want to use Model Gema, but I have no idea what the cost-benefit instance would be.
Which would be recommended? And which provider?

<!-- gh-comment-id:1970109862 --> @yvescleuder commented on GitHub (Feb 28, 2024): Hey @orlyandico, I have the same doubt, I want to use Model Gema, but I have no idea what the cost-benefit instance would be. Which would be recommended? And which provider?
Author
Owner

@orlyandico commented on GitHub (Feb 28, 2024):

I can only speak for Hyperstack (as that's what I use personally). The Big Three hyperscalers are more expensive with more features, but if you're going to only be running Ollama, overkill.

Another option: get an Ethereum mining rig (the mainboard, power supply, and case) and populate it with Tesla P40 GPU's from ebay (around $200 each). They aren't the fastest (a bit faster than an RTX 3060) but they have 24GB of VRAM each. If you can get three on the mainboard, that's 72GB of VRAM for under $1000.

<!-- gh-comment-id:1970115871 --> @orlyandico commented on GitHub (Feb 28, 2024): I can only speak for Hyperstack (as that's what I use personally). The Big Three hyperscalers are more expensive with more features, but if you're going to only be running Ollama, overkill. Another option: get an Ethereum mining rig (the mainboard, power supply, and case) and populate it with Tesla P40 GPU's from ebay (around $200 each). They aren't the fastest (a bit faster than an RTX 3060) but they have 24GB of VRAM each. If you can get three on the mainboard, that's 72GB of VRAM for under $1000.
Author
Owner

@yvescleuder commented on GitHub (Feb 28, 2024):

Hi @orlyandico,

I need to host it, as I will use it for my company.
We have a module that we made based on the OpenIA GPT 3.5

However, we do not use the API in the conversation model, but just a question. Every time our customer makes an interaction, they create a new question.
We wanted to use it in the conversation model, but with OpenIA it would be very expensive, as they charge for input and output (tokens).
We want to create conversation and for that I believe that a self-hosted model is the best option.

<!-- gh-comment-id:1970131582 --> @yvescleuder commented on GitHub (Feb 28, 2024): Hi @orlyandico, I need to host it, as I will use it for my company. We have a module that we made based on the OpenIA GPT 3.5 However, we do not use the API in the conversation model, but just a question. Every time our customer makes an interaction, they create a new question. We wanted to use it in the conversation model, but with OpenIA it would be very expensive, as they charge for input and output (tokens). We want to create conversation and for that I believe that a self-hosted model is the best option.
Author
Owner

@orlyandico commented on GitHub (Feb 29, 2024):

How large is the model? will it fit on a single GPU? most of the smaller hyperscalers provide single-GPU SKU's, which is cheaper but if your model won't fit.. then there is a problem.

Inference time is linear to model size. There are a lot of decent 7B models. But inference is quadratic to the context window, so if your chats get very long, inferences/second will drop. What is the # of users you expect? that may dictate how many GPU's you need to sustain a given inferences/second.

There is a very nice article here - https://newsletter.pragmaticengineer.com/p/scaling-chatgpt

TL; DR - ChatGPT (and presumably many/all LLM's) are memory bound, not GPU processor bound. Which means A100 is good enough (you may not need H100). But if your model is too large you will still end up with a multi-GPU configuration. However, Nvidia came up with some optimizations (not sure if Ollama is using them) that gets 2X performance increase on Ada generation. So you want an Ada GPU (like the RTX A6000 Ada that I reference above) - https://huggingface.co/blog/optimum-nvidia

You can also try quantizing the model down to 4 bits. Microsoft has some recent research showing good accuracy at 1.56 bits per weight (!) https://arxiv.org/html/2402.17764v1

(probably not available as publicly usable code yet)

If you are going to use a model that fits in a 16GB GPU, then I would look for whoever has the cheapest 16GB VRAM Ada generation GPU around.

Finally - why Gemma? it is not the highest-scoring 7B class model on HF LLM leaderboard..

<!-- gh-comment-id:1970143036 --> @orlyandico commented on GitHub (Feb 29, 2024): How large is the model? will it fit on a single GPU? most of the smaller hyperscalers provide single-GPU SKU's, which is cheaper but if your model won't fit.. then there is a problem. Inference time is linear to model size. There are a lot of decent 7B models. But inference is quadratic to the context window, so if your chats get very long, inferences/second will drop. What is the # of users you expect? that may dictate how many GPU's you need to sustain a given inferences/second. There is a very nice article here - https://newsletter.pragmaticengineer.com/p/scaling-chatgpt TL; DR - ChatGPT (and presumably many/all LLM's) are memory bound, not GPU processor bound. Which means A100 is good enough (you may not need H100). But if your model is too large you will still end up with a multi-GPU configuration. However, Nvidia came up with some optimizations (not sure if Ollama is using them) that gets 2X performance increase on Ada generation. So you want an Ada GPU (like the RTX A6000 Ada that I reference above) - https://huggingface.co/blog/optimum-nvidia You can also try quantizing the model down to 4 bits. Microsoft has some recent research showing good accuracy at 1.56 bits per weight (!) https://arxiv.org/html/2402.17764v1 (probably not available as publicly usable code yet) If you are going to use a model that fits in a 16GB GPU, then I would look for whoever has the cheapest 16GB VRAM Ada generation GPU around. Finally - why Gemma? it is not the highest-scoring 7B class model on HF LLM leaderboard..
Author
Owner

@yvescleuder commented on GitHub (Feb 29, 2024):

Let's go,

Maybe I was hasty when I spoke to "Gemma", I haven't actually chosen which AI is best for my scenario, especially because I need to study to understand which one to choose, I'm just entering the world of AI for now. So far I've used OpenIA, so I don't have many arguments about models and I don't even know how they work.
I don't have many simultaneous users, there are few, but they will use it in the conversation model, it could be a large chat.
I don't know exactly which model to use, how can I find out which is best for my scenario?
OpenIA works very well for us, but the cost is quite high if we are going to use conversation in model 4.

<!-- gh-comment-id:1970167443 --> @yvescleuder commented on GitHub (Feb 29, 2024): Let's go, Maybe I was hasty when I spoke to "Gemma", I haven't actually chosen which AI is best for my scenario, especially because I need to study to understand which one to choose, I'm just entering the world of AI for now. So far I've used OpenIA, so I don't have many arguments about models and I don't even know how they work. I don't have many simultaneous users, there are few, but they will use it in the conversation model, it could be a large chat. I don't know exactly which model to use, how can I find out which is best for my scenario? OpenIA works very well for us, but the cost is quite high if we are going to use conversation in model 4.
Author
Owner

@orlyandico commented on GitHub (Feb 29, 2024):

Well.. OpenAI 4.0 is the best model, hands down. None of the open-source ones come close. The one that comes closest (today) is Smaug-70B, which has been added to the Ollama repo. It is a huge model, however, you would probably need 2x A100 to self-host it (or 2x RTX A6000 Ada). Question is - do you need the accuracy of Smaug-70B? I have had good experience with DolphinPhi (which is in Ollama model gallery) which is a 1.6B model. Pretty much any model with RAG (and a corpus of your data stored in say OpenSearch or PostgreSQL) would probably perform acceptably.

Have you looked at this? it uses a smaller local LLM to reduce the token count to send to OpenAI, thus reducing cost - https://github.com/microsoft/LLMLingua/

Basically it removes tokens from the input that it thinks (based on the local 3B or 7B LLM) are not needed. In my experience... didn't work so great (it prints out what it thinks is the savings on OpenAI API calls). If you are using RAG, the prompts can get long really fast, so something like LLMLingua would be of help.

<!-- gh-comment-id:1970179455 --> @orlyandico commented on GitHub (Feb 29, 2024): Well.. OpenAI 4.0 is the best model, hands down. None of the open-source ones come close. The one that comes closest (today) is Smaug-70B, which has been added to the Ollama repo. It is a huge model, however, you would probably need 2x A100 to self-host it (or 2x RTX A6000 Ada). Question is - do you need the accuracy of Smaug-70B? I have had good experience with DolphinPhi (which is in Ollama model gallery) which is a 1.6B model. Pretty much any model with RAG (and a corpus of your data stored in say OpenSearch or PostgreSQL) would probably perform acceptably. Have you looked at this? it uses a smaller local LLM to reduce the token count to send to OpenAI, thus reducing cost - https://github.com/microsoft/LLMLingua/ Basically it removes tokens from the input that it thinks (based on the local 3B or 7B LLM) are not needed. In my experience... didn't work so great (it prints out what it thinks is the savings on OpenAI API calls). If you are using RAG, the prompts can get long really fast, so something like LLMLingua would be of help.
Author
Owner

@yvescleuder commented on GitHub (Feb 29, 2024):

Possibly I shouldn't need something at this level, I can do experiments within my application, I can start using small models to understand what I actually need.
What would be your recommendation? And what type of machine do I need?

<!-- gh-comment-id:1970184931 --> @yvescleuder commented on GitHub (Feb 29, 2024): Possibly I shouldn't need something at this level, I can do experiments within my application, I can start using small models to understand what I actually need. What would be your recommendation? And what type of machine do I need?
Author
Owner

@orlyandico commented on GitHub (Feb 29, 2024):

Here are some steps for doing RAG and Ollama locally on a Linux box - https://github.com/marklysze/LangChain-RAG-Linux

The key issue is that to answer the questions, you need to have an indexed corpus of the documents. Say if it's a bunch of local government laws, guidelines, regulations, etc. the LLM does not know anything about these, and would hallucinate. To avoid hallucinations and to "ground" the answer in the existing document corpus, you need RAG (and you need a database to hold the document corpus). The above link would step through these.

If you use something like DolphinPhi 1.6B as the LLM, then pretty much any GPU will work. I personally use an RTX 3060 (non-Ti) which has 12GB of RAM. It can handle the 7B models at decent inference rates (certainly enough for your prototyping). So a PC with Linux, an RTX 3060 or 4060, 64GB of RAM... should be plenty.

<!-- gh-comment-id:1970187432 --> @orlyandico commented on GitHub (Feb 29, 2024): Here are some steps for doing RAG and Ollama locally on a Linux box - https://github.com/marklysze/LangChain-RAG-Linux The key issue is that to answer the questions, you need to have an indexed corpus of the documents. Say if it's a bunch of local government laws, guidelines, regulations, etc. the LLM does not know anything about these, and would hallucinate. To avoid hallucinations and to "ground" the answer in the existing document corpus, you need RAG (and you need a database to hold the document corpus). The above link would step through these. If you use something like DolphinPhi 1.6B as the LLM, then pretty much any GPU will work. I personally use an RTX 3060 (non-Ti) which has 12GB of RAM. It can handle the 7B models at decent inference rates (certainly enough for your prototyping). So a PC with Linux, an RTX 3060 or 4060, 64GB of RAM... should be plenty.
Author
Owner

@bmizerany commented on GitHub (Mar 11, 2024):

This is a great question, and I hope you found your answer! I'm closing this only because it doesn't fall into the category of an "issue".

For general questions/help/support please join us in Discord or Reddit

<!-- gh-comment-id:1989538277 --> @bmizerany commented on GitHub (Mar 11, 2024): This is a great question, and I hope you found your answer! I'm closing this only because it doesn't fall into the category of an "issue". For general questions/help/support please join us in Discord or Reddit * https://discord.com/invite/ollama * https://www.reddit.com/r/ollama
Author
Owner

@gnumoksha commented on GitHub (Mar 18, 2024):

I've successfully run Ollama (llama2) on a g5.xlarge instance running Ubuntu 22.04. The CUDA library didn't work on Amazon Linux 2023.

<!-- gh-comment-id:2005169243 --> @gnumoksha commented on GitHub (Mar 18, 2024): I've successfully run Ollama (llama2) on a g5.xlarge instance running Ubuntu 22.04. The CUDA library didn't work on Amazon Linux 2023.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63342