[GH-ISSUE #14041] Unable to Force Ollama to Use the First GPU on Ubuntu 24.04.3 LTS #55686

Closed
opened 2026-04-29 09:34:48 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @GreenMap-chan on GitHub (Feb 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14041

What is the issue?

Hello,

I am using Ubuntu 24.04.3 LTS and I am unable to force Ollama to use the first GPU (I have two GPUs in total). I wrote this script to test:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from langchain_ollama import OllamaLLM
from langchain_core.prompts import PromptTemplate

llm = OllamaLLM(
    model="yandex/YandexGPT-5-Lite-8B-instruct-GGUF:latest",
    base_url="localhost:11434",
    temperature=0.0,
    keep_alive=1
)

prompt = "Hello. What is the sense of life?"
chain = PromptTemplate.from_template(prompt) | llm
response = chain.invoke({})
print(response)

However, despite setting CUDA_VISIBLE_DEVICES to "0", the second GPU is being used instead of the first one:

Image

Is this a bug, or am I doing something wrong?

Relevant log output


OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.12.10

Originally created by @GreenMap-chan on GitHub (Feb 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14041 ### What is the issue? Hello, I am using Ubuntu 24.04.3 LTS and I am unable to force Ollama to use the first GPU (I have two GPUs in total). I wrote this script to test: ``` import os os.environ["CUDA_VISIBLE_DEVICES"] = "0" from langchain_ollama import OllamaLLM from langchain_core.prompts import PromptTemplate llm = OllamaLLM( model="yandex/YandexGPT-5-Lite-8B-instruct-GGUF:latest", base_url="localhost:11434", temperature=0.0, keep_alive=1 ) prompt = "Hello. What is the sense of life?" chain = PromptTemplate.from_template(prompt) | llm response = chain.invoke({}) print(response) ``` However, despite setting CUDA_VISIBLE_DEVICES to "0", the second GPU is being used instead of the first one: <img width="666" height="266" alt="Image" src="https://github.com/user-attachments/assets/450c692e-2fe5-425f-ac08-87a04daf1a28" /> Is this a bug, or am I doing something wrong? ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU _No response_ ### Ollama version 0.12.10
GiteaMirror added the bug label 2026-04-29 09:34:48 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 3, 2026):

You have to set CUDA_VISIBLE_DEVICES in the server environment.

<!-- gh-comment-id:3840242529 --> @rick-github commented on GitHub (Feb 3, 2026): You have to set `CUDA_VISIBLE_DEVICES` in the server environment.
Author
Owner

@GreenMap-chan commented on GitHub (Feb 3, 2026):

You have to set CUDA_VISIBLE_DEVICES in the server environment.

However, this server is used by many apps, some of which need to use the second GPU. How can I use only the first one for my script?

<!-- gh-comment-id:3840299921 --> @GreenMap-chan commented on GitHub (Feb 3, 2026): > You have to set `CUDA_VISIBLE_DEVICES` in the server environment. However, this server is used by many apps, some of which need to use the second GPU. How can I use only the first one for my script?
Author
Owner

@rick-github commented on GitHub (Feb 3, 2026):

The client cannot choose which GPU is used to run the model. If you want to be able to select which GPU to use, run two ollama servers and use CUDA_VISIBLE_DEVICES to bind a GPU to each server, and have the client call the appropriate server.

<!-- gh-comment-id:3840320821 --> @rick-github commented on GitHub (Feb 3, 2026): The client cannot choose which GPU is used to run the model. If you want to be able to select which GPU to use, run two ollama servers and use `CUDA_VISIBLE_DEVICES` to bind a GPU to each server, and have the client call the appropriate server.
Author
Owner

@GreenMap-chan commented on GitHub (Feb 3, 2026):

The client cannot choose which GPU is used to run the model. If you want to be able to select which GPU to use, run two ollama servers and use CUDA_VISIBLE_DEVICES to bind a GPU to each server, and have the client call the appropriate server.

Thank you for the suggestion. Running two Ollama servers and binding each server to a specific GPU using CUDA_VISIBLE_DEVICES seems like a good solution. I'll try this approach and see if it works for my use case.

Thanks again for your help!

<!-- gh-comment-id:3840337785 --> @GreenMap-chan commented on GitHub (Feb 3, 2026): > The client cannot choose which GPU is used to run the model. If you want to be able to select which GPU to use, run two ollama servers and use `CUDA_VISIBLE_DEVICES` to bind a GPU to each server, and have the client call the appropriate server. Thank you for the suggestion. Running two Ollama servers and binding each server to a specific GPU using CUDA_VISIBLE_DEVICES seems like a good solution. I'll try this approach and see if it works for my use case. Thanks again for your help!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55686