[GH-ISSUE #1240] The DeepSeek-Coder AI model is not loading entirely into RAM, causing the model responses to be very slow. #62668

Closed
opened 2026-05-03 09:55:49 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @jveeru on GitHub (Nov 22, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1240

Hi,

I am using Ollama on a Mac Studio M1 Max with 64GB RAM. I have experimented with different models such as DeepSeek Coder AI 33b, WizardCoder Python 13b, and Mistral 7b text. Most of these models are stored entirely in RAM, except for the DeepSeek Coder model. The 33b model uses less than 4GB of RAM, while WizardCoder uses a little over 13GB of RAM. I am not sure how I can increase the memory limit for a specific model. I've tried different versions of the DeepSeek Coder model, but they all encounter similar problems when using 33b models.

Is there any parameter that I need to include in Modelfile or Command while running the model?

Originally created by @jveeru on GitHub (Nov 22, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1240 Hi, I am using Ollama on a Mac Studio M1 Max with 64GB RAM. I have experimented with different models such as DeepSeek Coder AI 33b, WizardCoder Python 13b, and Mistral 7b text. Most of these models are stored entirely in RAM, except for the DeepSeek Coder model. The 33b model uses less than 4GB of RAM, while WizardCoder uses a little over 13GB of RAM. I am not sure how I can increase the memory limit for a specific model. I've tried different versions of the DeepSeek Coder model, but they all encounter similar problems when using 33b models. Is there any parameter that I need to include in `Modelfile` or `Command` while running the model?
Author
Owner

@easp commented on GitHub (Nov 23, 2023):

On MacOS the model data is memory mapped. It doesn't show up in ollama-runner's memory allocation. It will swell the size of the File Cache though.

In short, it doesn't work the way you think it does, but it works the way it should; the model is in RAM during inference/text generation.

<!-- gh-comment-id:1824764893 --> @easp commented on GitHub (Nov 23, 2023): On MacOS the model data is memory mapped. It doesn't show up in ollama-runner's memory allocation. It will swell the size of the File Cache though. In short, it doesn't work the way you think it does, but it works the way it should; the model is in RAM during inference/text generation.
Author
Owner

@orlyandico commented on GitHub (Nov 29, 2023):

I tried running the goliath model on an Intel Mac with 128GB of RAM and no GPU. I see around 2GB of memory usage of ollama-runner and 79GB of cached files, which tracks with the above. Inference is extremely slow, however. Is this a function of the comparatively large model? (e.g. codellama on the same hardware, doing inference on the CPU, is much faster - actually usable)

<!-- gh-comment-id:1831698453 --> @orlyandico commented on GitHub (Nov 29, 2023): I tried running the goliath model on an Intel Mac with 128GB of RAM and no GPU. I see around 2GB of memory usage of ollama-runner and 79GB of cached files, which tracks with the above. Inference is extremely slow, however. Is this a function of the comparatively large model? (e.g. codellama on the same hardware, doing inference on the CPU, is much faster - actually usable)
Author
Owner

@easp commented on GitHub (Nov 29, 2023):

@orlyandico Text generation is highly dependent on memory bandwidth and the whole model is traversed for every token. So, there is an inverse relationship between model size and text generation rate. 2x sized model will be ~1/2 as fast. Is that what you are seeing?

<!-- gh-comment-id:1832317245 --> @easp commented on GitHub (Nov 29, 2023): @orlyandico Text generation is highly dependent on memory bandwidth and the whole model is traversed for every token. So, there is an inverse relationship between model size and text generation rate. 2x sized model will be ~1/2 as fast. Is that what you are seeing?
Author
Owner

@orlyandico commented on GitHub (Nov 29, 2023):

@easp that makes complete sense (I realised after posting the above, that you need to evaluate all the weights on every inference). Seems that there's no way to evaluate large models at decent speeds without having an H100 or A100 handy...

<!-- gh-comment-id:1832346668 --> @orlyandico commented on GitHub (Nov 29, 2023): @easp that makes complete sense (I realised after posting the above, that you need to evaluate all the weights on every inference). Seems that there's no way to evaluate large models at decent speeds without having an H100 or A100 handy...
Author
Owner

@technovangelist commented on GitHub (Dec 19, 2023):

It looks like @easp answers solved this issue. I will go ahead and close it now. If you think there is anything we left out, reopen and we can address. Thanks for being part of this great community.

<!-- gh-comment-id:1863306642 --> @technovangelist commented on GitHub (Dec 19, 2023): It looks like @easp answers solved this issue. I will go ahead and close it now. If you think there is anything we left out, reopen and we can address. Thanks for being part of this great community.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62668