[GH-ISSUE #10707] Add option to disable CPU fallback when GPU memory is insufficient #69095

Open
opened 2026-05-04 17:08:30 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @seanthegeek on GitHub (May 14, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10707

Hi Ollama team,

Thanks for the awesome project — it's very fun to run local LLMs efficiently.

Feature Request

I'd like to request a configuration option (environment variable, CLI flag, or config file setting) to disable automatic CPU fallback when the GPU does not have enough memory to load a model.

Why?

In certain environments (e.g., real-time systems, performance-sensitive applications, or limited CPU access), silently falling back to CPU:

  • Results in drastically reduced performance.
  • Violates expected hardware usage.
  • Can lead to hidden failures or long processing delays.

For example, I use the same server and GPU to use LLMs and Stable Diffusion. I share access to OpenWebUI on the server to lets friends try it out, but if I'm rendering an image at the same time, 100% of the CPU is used and the response times are painfully slow. I'd rather just have it error out so they can try again when the GPU is free.

Desired Behavior

Instead of falling back to CPU, Ollama should:

  • Return an error or fail fast when the model cannot fit in available GPU memory.
  • Optionally include a flag like --gpu-only or an environment variable like OLLAMA_GPU_ONLY=true.

Example

OLLAMA_GPU_ONLY=true ollama run llama3:70b
# => Error: insufficient GPU memory, model cannot be loaded

Current Workarounds

While some users restrict CPU usage externally (e.g., via Docker), it's a brittle workaround and doesn't clearly signal a failure due to GPU memory limits.

Summary

This feature would provide better control for users who need strict GPU enforcement, and would help avoid unexpected performance degradations in critical systems.

Thanks again for all the great work you're doing!

Originally created by @seanthegeek on GitHub (May 14, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10707 Hi Ollama team, Thanks for the awesome project — it's very fun to run local LLMs efficiently. ## Feature Request I'd like to request a configuration option (environment variable, CLI flag, or config file setting) to **disable automatic CPU fallback** when the GPU does not have enough memory to load a model. ## Why? In certain environments (e.g., real-time systems, performance-sensitive applications, or limited CPU access), silently falling back to CPU: - Results in drastically reduced performance. - Violates expected hardware usage. - Can lead to hidden failures or long processing delays. For example, I use the same server and GPU to use LLMs and Stable Diffusion. I share access to OpenWebUI on the server to lets friends try it out, but if I'm rendering an image at the same time, 100% of the CPU is used and the response times are painfully slow. I'd rather just have it error out so they can try again when the GPU is free. ## Desired Behavior Instead of falling back to CPU, Ollama should: - Return an error or fail fast when the model cannot fit in available GPU memory. - Optionally include a flag like `--gpu-only` or an environment variable like `OLLAMA_GPU_ONLY=true`. ## Example ```bash OLLAMA_GPU_ONLY=true ollama run llama3:70b # => Error: insufficient GPU memory, model cannot be loaded ``` ## Current Workarounds While some users restrict CPU usage externally (e.g., via Docker), it's a brittle workaround and doesn't clearly signal a failure due to GPU memory limits. ## Summary This feature would provide better control for users who need strict GPU enforcement, and would help avoid unexpected performance degradations in critical systems. Thanks again for all the great work you're doing!
GiteaMirror added the feature request label 2026-05-04 17:08:30 -05:00
Author
Owner

@rick-github commented on GitHub (May 14, 2025):

As a workaround, use this in the service file.

ExecStart=bash -c 'exec prlimit --data=$[500 * 1024 * 1024] /usr/local/bin/ollama serve'

Windows and MacOS have similar OS level controls.

<!-- gh-comment-id:2881473384 --> @rick-github commented on GitHub (May 14, 2025): As a workaround, use this in the service file. ``` ExecStart=bash -c 'exec prlimit --data=$[500 * 1024 * 1024] /usr/local/bin/ollama serve' ``` Windows and MacOS have similar OS level controls.
Author
Owner

@Crystal-Spider commented on GitHub (Mar 24, 2026):

I second this: having an environment variable would be very handy for me as well.
I run a server that, other than loading LLMs, also handles other workloads. I want to prevent LLMs to run on CPU if the GPUs are full and instead just fail requests to let users know to try again later because the server is busy, instead of letting them hang on and waiting for the painfully slow CPU-loaded models.

<!-- gh-comment-id:4118903429 --> @Crystal-Spider commented on GitHub (Mar 24, 2026): I second this: having an environment variable would be very handy for me as well. I run a server that, other than loading LLMs, also handles other workloads. I want to prevent LLMs to run on CPU if the GPUs are full and instead just fail requests to let users know to try again later because the server is busy, instead of letting them hang on and waiting for the painfully slow CPU-loaded models.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69095