[GH-ISSUE #4251] Ollama using minimal GPU on Windows #49163

Closed
opened 2026-04-28 10:52:02 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @Freffles on GitHub (May 8, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4251

Originally assigned to: @dhiltgen on GitHub.

Had this problem for a while and thought it was fixed in 1.34 Have GTX1050 Ti, not being used by Ollama. Here is a partial log file.

ollama_logsextract.txt

Originally created by @Freffles on GitHub (May 8, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4251 Originally assigned to: @dhiltgen on GitHub. Had this problem for a while and thought it was fixed in 1.34 Have GTX1050 Ti, not being used by Ollama. Here is a partial log file. [ollama_logsextract.txt](https://github.com/ollama/ollama/files/15244475/ollama_logsextract.txt)
GiteaMirror added the needs more info label 2026-04-28 10:52:02 -05:00
Author
Owner

@dhiltgen commented on GitHub (May 8, 2024):

My suspicion is your client may be setting num_gpu to 1 in the request. Is that possible?

<!-- gh-comment-id:2101275092 --> @dhiltgen commented on GitHub (May 8, 2024): My suspicion is your client may be setting num_gpu to 1 in the request. Is that possible?
Author
Owner

@Freffles commented on GitHub (May 8, 2024):

I guess it's possible, I had been using the model_name and hence the rest was at default settings. I will try giving it a tweak and see if there is any improvement. If it shouldn't be 1, what should it be?

<!-- gh-comment-id:2101541439 --> @Freffles commented on GitHub (May 8, 2024): I guess it's possible, I had been using the model_name and hence the rest was at default settings. I will try giving it a tweak and see if there is any improvement. If it shouldn't be 1, what should it be?
Author
Owner

@dhiltgen commented on GitHub (May 8, 2024):

This setting defines how many layers to load into the GPU. If unspecified or set to -1 the system will load as many layers as possible. I believe we had a bug in prior versions where a "1" was treated the same as -1 and resulted in the system loading as many layers as possible, which is likely what you want for best performance.

<!-- gh-comment-id:2101600415 --> @dhiltgen commented on GitHub (May 8, 2024): This setting defines how many layers to load into the GPU. If unspecified or set to -1 the system will load as many layers as possible. I believe we had a bug in prior versions where a "1" was treated the same as -1 and resulted in the system loading as many layers as possible, which is likely what you want for best performance.
Author
Owner

@Freffles commented on GitHub (May 8, 2024):

This is how I'm setting up:

def get_ollama_settings(embeddings_model): # See parameters comment below for more info. Current at 8 May 2024
    return {
        "model": embeddings_model,
        "embed_instruction": "text: ",
        "query_instruction": "query: ",
        "num_ctx": 1024,
        "num_gpu": -1,
        "num_thread": 1,
        "repeat_last_n": 64,
        "repeat_penalty": 1.1,
        "temperature": 0.8,
        "tfs_z": 1.0,
        "top_k": 40,
        "top_p": 0.95,
        "show_progress": True
    }

Screenshot of progress. This speed is unchanged despite multiple ollama & cuda removals, reinstalls and tweaks
image

ollama_logs.txt

<!-- gh-comment-id:2101633314 --> @Freffles commented on GitHub (May 8, 2024): This is how I'm setting up: ``` def get_ollama_settings(embeddings_model): # See parameters comment below for more info. Current at 8 May 2024 return { "model": embeddings_model, "embed_instruction": "text: ", "query_instruction": "query: ", "num_ctx": 1024, "num_gpu": -1, "num_thread": 1, "repeat_last_n": 64, "repeat_penalty": 1.1, "temperature": 0.8, "tfs_z": 1.0, "top_k": 40, "top_p": 0.95, "show_progress": True } ``` Screenshot of progress. This speed is unchanged despite multiple ollama & cuda removals, reinstalls and tweaks ![image](https://github.com/ollama/ollama/assets/122719748/9b653221-ae2c-4dea-b727-2b2875778879) [ollama_logs.txt](https://github.com/ollama/ollama/files/15255775/ollama_logs.txt)
Author
Owner

@Freffles commented on GitHub (May 9, 2024):

FYI, tried using Anything LLM and it seemed to work much better.
LLMollama_logs.txt

<!-- gh-comment-id:2101838746 --> @Freffles commented on GitHub (May 9, 2024): FYI, tried using Anything LLM and it seemed to work much better. [LLMollama_logs.txt](https://github.com/ollama/ollama/files/15256953/LLMollama_logs.txt)
Author
Owner

@dhiltgen commented on GitHub (May 10, 2024):

The latest logs you shared shows it loading ~all the layers into the GPU

ollama_logs.txt:

llm_load_tensors: offloaded 13/13 layers to GPU

LLMollama_logs.txt:

llm_load_tensors: offloaded 32/33 layers to GPU

This is in comparison to the log you shared in the opening comment of the issue which only loaded 1 of 13 layers.

I'm not sure what changed between those runs, but can you clarify the performance you're seeing when it loads ~all the layers?

<!-- gh-comment-id:2105326546 --> @dhiltgen commented on GitHub (May 10, 2024): The latest logs you shared shows it loading ~all the layers into the GPU ollama_logs.txt: ``` llm_load_tensors: offloaded 13/13 layers to GPU ``` LLMollama_logs.txt: ``` llm_load_tensors: offloaded 32/33 layers to GPU ``` This is in comparison to the log you shared in the opening comment of the issue which only loaded 1 of 13 layers. I'm not sure what changed between those runs, but can you clarify the performance you're seeing when it loads ~all the layers?
Author
Owner

@Freffles commented on GitHub (May 10, 2024):

I will have another look at this when I'm in front of the computer. Meanwhile, the log file reflects one session where I was using my own python code and then using Anything LLM with no other changes. When I use the latter, more GPU is used.

Ok, have done another run with Anything LLM. The "real" story is that GPU hardly used during embedding with nomic-embed-text (the only ollama embedding model I have tried). GPU is used when using chat. Log file attached.

ollama_logs.txt

<!-- gh-comment-id:2105352338 --> @Freffles commented on GitHub (May 10, 2024): I will have another look at this when I'm in front of the computer. Meanwhile, the log file reflects one session where I was using my own python code and then using Anything LLM with no other changes. When I use the latter, more GPU is used. **Ok, have done another run with Anything LLM**. The "real" story is that GPU hardly used during embedding with nomic-embed-text (the only ollama embedding model I have tried). GPU is used when using chat. Log file attached. [ollama_logs.txt](https://github.com/ollama/ollama/files/15281264/ollama_logs.txt)
Author
Owner

@Freffles commented on GitHub (May 13, 2024):

FYI, I just managed to complete embeddings of a github repo using ollama nomic-embed-text via Anything LLM and using Chroma in about 3 minutes. That had been taking over 20 previously. Here is the log file, I will try this again outside of LLM and see what happens.

ollama_logs.txt

<!-- gh-comment-id:2106858460 --> @Freffles commented on GitHub (May 13, 2024): FYI, I just managed to complete embeddings of a github repo using ollama nomic-embed-text via Anything LLM and using Chroma in about 3 minutes. That had been taking over 20 previously. Here is the log file, I will try this again outside of LLM and see what happens. [ollama_logs.txt](https://github.com/ollama/ollama/files/15290928/ollama_logs.txt)
Author
Owner

@dhiltgen commented on GitHub (Aug 1, 2024):

As of v0.2.6 (and newer) the embeddings API now takes a list of embeddings, and will parallelize the processing. Let us know if that addresses the performance problem you were facing.

https://github.com/ollama/ollama/blob/main/docs/api.md#generate-embeddings

<!-- gh-comment-id:2264127263 --> @dhiltgen commented on GitHub (Aug 1, 2024): As of v0.2.6 (and newer) the embeddings API now takes a list of embeddings, and will parallelize the processing. Let us know if that addresses the performance problem you were facing. https://github.com/ollama/ollama/blob/main/docs/api.md#generate-embeddings
Author
Owner

@ashokgelal commented on GitHub (Aug 6, 2024):

If -1 is "auto", is there a magic number for "max"? For an app that provides an interface to Ollama, what should it use so that the user can say "I want to use max GPU layers". Or is -1 more like max?

<!-- gh-comment-id:2270209073 --> @ashokgelal commented on GitHub (Aug 6, 2024): If -1 is "auto", is there a magic number for "max"? For an app that provides an interface to Ollama, what should it use so that the user can say "I want to use max GPU layers". Or is -1 more like max?
Author
Owner

@dhiltgen commented on GitHub (Aug 6, 2024):

@ashokgelal don't set any override environment variables or parameters and ollama will strive to maximize VRAM utilization.

<!-- gh-comment-id:2271729485 --> @dhiltgen commented on GitHub (Aug 6, 2024): @ashokgelal don't set any override environment variables or parameters and ollama will strive to maximize VRAM utilization.
Author
Owner

@ashokgelal commented on GitHub (Aug 6, 2024):

Thanks @dhiltgen. Is that the same as setting it to -1? If not, what does it do?

<!-- gh-comment-id:2271761809 --> @ashokgelal commented on GitHub (Aug 6, 2024): Thanks @dhiltgen. Is that the same as setting it to -1? If not, what does it do?
Author
Owner

@dhiltgen commented on GitHub (Aug 9, 2024):

@ashokgelal I don't know what operating system you're on, so it varies, but in windows, you'd delete the entry from the dialog for environment variables. On Linux you'd remove the entry from your systemd configuration. On mac you use launchctl. Instructions can be found here - https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server.

I'll go ahead and close this one now as it should work much better/faster. @Freffles if you're still seeing performance problems with the new batch support, please provide some updated information and I'll reopen.

<!-- gh-comment-id:2278866444 --> @dhiltgen commented on GitHub (Aug 9, 2024): @ashokgelal I don't know what operating system you're on, so it varies, but in windows, you'd delete the entry from the dialog for environment variables. On Linux you'd remove the entry from your systemd configuration. On mac you use launchctl. Instructions can be found here - https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server. I'll go ahead and close this one now as it should work much better/faster. @Freffles if you're still seeing performance problems with the new batch support, please provide some updated information and I'll reopen.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49163