[GH-ISSUE #1416] Attempting to load a model smaller than 10GiB into 12.2GiB GPU results in failing over to load into the host RAM. #47267

Closed
opened 2026-04-28 03:29:25 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @phalexo on GitHub (Dec 7, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1416

Originally assigned to: @dhiltgen on GitHub.

I have converted losslessmegacoder-llama2-13b-min.Q6_K.model to ollama format.

On my attempt to load, it reports the size of the model < 10GiB, but as I do "ollama run losslessmegacoder-llama2-13b-min.Q6_K" it attempts to load it into a GPU, apparently runs out of VRAM and loads into the host instead.

If the model is smaller than 10GiB, why is it using additional 2.2GiB and is there anything I can do to mitigate this?

Originally created by @phalexo on GitHub (Dec 7, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1416 Originally assigned to: @dhiltgen on GitHub. I have converted losslessmegacoder-llama2-13b-min.Q6_K.model to ollama format. On my attempt to load, it reports the size of the model < 10GiB, but as I do "ollama run losslessmegacoder-llama2-13b-min.Q6_K" it attempts to load it into a GPU, apparently runs out of VRAM and loads into the host instead. If the model is smaller than 10GiB, why is it using additional 2.2GiB and is there anything I can do to mitigate this?
Author
Owner

@BruceMacD commented on GitHub (Dec 7, 2023):

Hi @phalexo aside from using a smaller model there are a couple things you could try.

  1. Make sure you have as much VRAM free as possible. Ollama will only be able to use the VRAM not being used by other programs.
  2. Check the context size of the model you are trying to run, it may result in the model being too large for your VRAM.

If you want to not load the context into VRAM you can set the num_gpu parameter to the number of model layers to see if that helps. Here is an example of how I would do that for a model with 32 layers.

ollama run some_model
>>> /set parameter num_gpu 32
Set parameter 'num_gpu' to '32'
<!-- gh-comment-id:1846044565 --> @BruceMacD commented on GitHub (Dec 7, 2023): Hi @phalexo aside from using a smaller model there are a couple things you could try. 1. Make sure you have as much VRAM free as possible. Ollama will only be able to use the VRAM not being used by other programs. 2. Check the context size of the model you are trying to run, it may result in the model being too large for your VRAM. If you want to not load the context into VRAM you can set the `num_gpu` parameter to the number of model layers to see if that helps. Here is an example of how I would do that for a model with 32 layers. ``` ollama run some_model >>> /set parameter num_gpu 32 Set parameter 'num_gpu' to '32' ```
Author
Owner

@phalexo commented on GitHub (Dec 7, 2023):

Hi @phalexo aside from using a smaller model there are a couple things you could try.

1. Make sure you have as much VRAM free as possible. Ollama will only be able to use the VRAM not being used by other programs.

2. Check the context size of the model you are trying to run, it may result in the model being too large for your VRAM.

If you want to not load the context into VRAM you can set the num_gpu parameter to the number of model layers to see if that helps. Here is an example of how I would do that for a model with 32 layers.

ollama run some_model
>>> /set parameter num_gpu 32
Set parameter 'num_gpu' to '32'

Is it possible to set this in the model file?

I will experiment with this parameter, but clarify for me something else if you can?

OLLAMA_HOST:0.0.0.0: 424242 litellm --model ollama/model_name

Would the above line force litellm attach to ollama using the specified port or would it still default to 11434?

And if this does not change the default port that ollama presents, is there a way to do it?

<!-- gh-comment-id:1846062838 --> @phalexo commented on GitHub (Dec 7, 2023): > Hi @phalexo aside from using a smaller model there are a couple things you could try. > > 1. Make sure you have as much VRAM free as possible. Ollama will only be able to use the VRAM not being used by other programs. > > 2. Check the context size of the model you are trying to run, it may result in the model being too large for your VRAM. > > > If you want to not load the context into VRAM you can set the `num_gpu` parameter to the number of model layers to see if that helps. Here is an example of how I would do that for a model with 32 layers. > > ``` > ollama run some_model > >>> /set parameter num_gpu 32 > Set parameter 'num_gpu' to '32' > ``` Is it possible to set this in the model file? I will experiment with this parameter, but clarify for me something else if you can? OLLAMA_HOST:0.0.0.0: 424242 litellm --model ollama/model_name Would the above line force litellm attach to ollama using the specified port or would it still default to 11434? And if this does not change the default port that ollama presents, is there a way to do it?
Author
Owner

@mxyng commented on GitHub (Dec 7, 2023):

Is it possible to set this in the model file?

Yes, you can set PARAMETER num_gpu 32 in a Modelfile to achieve the same thing

OLLAMA_HOST:0.0.0.0: 424242 litellm --model ollama/model_name
Would the above line force litellm attach to ollama using the specified port or would it still default to 11434?

That's a question better asked in the litellm repo. Ollama uses OLLAMA_HOST to configure the host and port so the best answer I can give is if you start ollama manually with those settings, it'll use the specified ports

Note: it looks like there's a space between host (0.0.0.0) and port (424242). I'm not sure if that's intentional or a typo. As is, it's likely an invalid shell command

<!-- gh-comment-id:1846261438 --> @mxyng commented on GitHub (Dec 7, 2023): > Is it possible to set this in the model file? Yes, you can set `PARAMETER num_gpu 32` in a Modelfile to achieve the same thing > OLLAMA_HOST:0.0.0.0: 424242 litellm --model ollama/model_name > Would the above line force litellm attach to ollama using the specified port or would it still default to 11434? That's a question better asked in the litellm repo. Ollama uses `OLLAMA_HOST` to configure the host and port so the best answer I can give is if you start ollama manually with those settings, it'll use the specified ports Note: it looks like there's a space between host (0.0.0.0) and port (424242). I'm not sure if that's intentional or a typo. As is, it's likely an invalid shell command
Author
Owner

@phalexo commented on GitHub (Dec 7, 2023):

Is it possible to set this in the model file?

Yes, you can set PARAMETER num_gpu 32 in a Modelfile to achieve the same thing

OLLAMA_HOST:0.0.0.0: 424242 litellm --model ollama/model_name
Would the above line force litellm attach to ollama using the specified port or would it still default to 11434?

That's a question better asked in the litellm repo. Ollama uses OLLAMA_HOST to configure the host and port so the best answer I can give is if you start ollama manually with those settings, it'll use the specified ports

Note: it looks like there's a space between host (0.0.0.0) and port (424242). I'm not sure if that's intentional or a typo. As is, it's likely an invalid shell command

Turns out I have to do this:

OLLAMA_HOST=127.0.0.1:11435 litellm --model ollama/deepseek-coder-6.7b-instruct.Q6_K --api_base http://localhost:11435 --port 8001

Unfortunately I am not getting async behavior I hoped for. It runs using one GPU at a time, i.e. lower utilization than just swapping models in and out, and running each model on all GPUs.

Still a puzzle how to run AutoGen using async processing.

<!-- gh-comment-id:1846268376 --> @phalexo commented on GitHub (Dec 7, 2023): > > Is it possible to set this in the model file? > > Yes, you can set `PARAMETER num_gpu 32` in a Modelfile to achieve the same thing > > > OLLAMA_HOST:0.0.0.0: 424242 litellm --model ollama/model_name > > Would the above line force litellm attach to ollama using the specified port or would it still default to 11434? > > That's a question better asked in the litellm repo. Ollama uses `OLLAMA_HOST` to configure the host and port so the best answer I can give is if you start ollama manually with those settings, it'll use the specified ports > > Note: it looks like there's a space between host (0.0.0.0) and port (424242). I'm not sure if that's intentional or a typo. As is, it's likely an invalid shell command Turns out I have to do this: ``` OLLAMA_HOST=127.0.0.1:11435 litellm --model ollama/deepseek-coder-6.7b-instruct.Q6_K --api_base http://localhost:11435 --port 8001 ``` Unfortunately I am not getting async behavior I hoped for. It runs using one GPU at a time, i.e. lower utilization than just swapping models in and out, and running each model on all GPUs. Still a puzzle how to run AutoGen using async processing.
Author
Owner

@phalexo commented on GitHub (Dec 8, 2023):

It's seems this is a question about litellm. Their repo or discord may be a better place to ask since they will have experience with their product.

Well, the fact that ollama loads and unloads models in their entirety instead of holding multiple models' weights in VRAM (subject to VRAM availability) is a very ollama question. Even if I have two small models that should fit at the same time, it does not happen.

The higher level async behavior is likely a question for AutoGen people.

<!-- gh-comment-id:1847259084 --> @phalexo commented on GitHub (Dec 8, 2023): > It's seems this is a question about litellm. Their repo or discord may be a better place to ask since they will have experience with their product. Well, the fact that ollama loads and unloads models in their entirety instead of holding multiple models' weights in VRAM (subject to VRAM availability) is a very ollama question. Even if I have two small models that should fit at the same time, it does not happen. The higher level async behavior is likely a question for AutoGen people.
Author
Owner

@dhiltgen commented on GitHub (Mar 12, 2024):

@phalexo at this time we only load a single model. It sounds like this issue has morphed to track that, which makes it a dup of #2109 which is something we're working on.

<!-- gh-comment-id:1992105680 --> @dhiltgen commented on GitHub (Mar 12, 2024): @phalexo at this time we only load a single model. It sounds like this issue has morphed to track that, which makes it a dup of #2109 which is something we're working on.
Author
Owner

@phalexo commented on GitHub (Mar 12, 2024):

I made that comment a long time ago, so I don't quite remember what I was
doing.

That said, I was able to successfully load multiple models into the same 4
gpus, with each model spread over multiple gpus.

On Tue, Mar 12, 2024, 12:44 PM Daniel Hiltgen @.***>
wrote:

@phalexo https://github.com/phalexo at this time we only load a single
model. It sounds like this issue has morphed to track that, which makes it
a dup of #2109 https://github.com/ollama/ollama/issues/2109 which is
something we're working on.


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/1416#issuecomment-1992105680,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZLDP3QC6VNZ2EFEP3TYX4WIDAVCNFSM6AAAAABALGMTI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJSGEYDKNRYGA
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:1992112062 --> @phalexo commented on GitHub (Mar 12, 2024): I made that comment a long time ago, so I don't quite remember what I was doing. That said, I was able to successfully load multiple models into the same 4 gpus, with each model spread over multiple gpus. On Tue, Mar 12, 2024, 12:44 PM Daniel Hiltgen ***@***.***> wrote: > @phalexo <https://github.com/phalexo> at this time we only load a single > model. It sounds like this issue has morphed to track that, which makes it > a dup of #2109 <https://github.com/ollama/ollama/issues/2109> which is > something we're working on. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/1416#issuecomment-1992105680>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZLDP3QC6VNZ2EFEP3TYX4WIDAVCNFSM6AAAAABALGMTI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJSGEYDKNRYGA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47267