[GH-ISSUE #5754] OLLAMA_MAX_VRAM is ignored #65621

Closed
opened 2026-05-03 21:55:51 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @BartWillems on GitHub (Jul 17, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5754

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I'm trying to limit the GPU memory usage, so I set the OLLAMA_MAX_VRAM env var.
I see it is correctly parsed in the logs, but the limit itself is ignored.

When I set the limit to 5000000000 (5GB) the llama3:8b model will use 6172MiB according to nvidia-smi.
Even when I set it to an absurdly low value like 5 it still uses more than 6GB of memory

OS

Linux, Docker

GPU

Nvidia

CPU

AMD

Ollama version

0.2.5

Originally created by @BartWillems on GitHub (Jul 17, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5754 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I'm trying to limit the GPU memory usage, so I set the `OLLAMA_MAX_VRAM ` env var. I see it is correctly parsed in the logs, but the limit itself is ignored. When I set the limit to `5000000000` (5GB) the `llama3:8b` model will use `6172MiB` according to `nvidia-smi`. Even when I set it to an absurdly low value like `5` it still uses more than 6GB of memory ### OS Linux, Docker ### GPU Nvidia ### CPU AMD ### Ollama version 0.2.5
GiteaMirror added the bug label 2026-05-03 21:55:51 -05:00
Author
Owner

@ProjectMoon commented on GitHub (Jul 19, 2024):

I also ran into this. As far as I can tell, the config option isn't actually used anywhere in the codebase at the moment?

<!-- gh-comment-id:2238742982 --> @ProjectMoon commented on GitHub (Jul 19, 2024): I also ran into this. As far as I can tell, the config option isn't actually used anywhere in the codebase at the moment?
Author
Owner

@brodieferguson commented on GitHub (Jul 23, 2024):

I was manually setting it as I'm on WSL2 docker, and the autodetection seemed to always put me over into "shared" memory because of the new NVIDIA memory handling. This seemed slower than setting a 1GB lower limit manually and having extra offloaded layers to CPU

<!-- gh-comment-id:2245902989 --> @brodieferguson commented on GitHub (Jul 23, 2024): I was manually setting it as I'm on WSL2 docker, and the autodetection seemed to always put me over into "shared" memory because of the new NVIDIA memory handling. This seemed slower than setting a 1GB lower limit manually and having extra offloaded layers to CPU
Author
Owner

@ixenion commented on GitHub (Oct 2, 2025):

same issue, got any solution?

<!-- gh-comment-id:3363131455 --> @ixenion commented on GitHub (Oct 2, 2025): same issue, got any solution?
Author
Owner

@stephanschielke commented on GitHub (Dec 20, 2025):

Just to spell it out because I landed here as well:

OLLAMA_MAX_VRAM was deprecated and is no longer in use.
It was removed here: cc269ba

Apparently, the OLLAMA_MAX_VRAM environment variable was intended as a quick workaround to avoid Out Of Memory errors when working with multiple GPUs.

Instead of setting a fixed VRAM limit, you can try setting OLLAMA_GPU_OVERHEAD, which, as far as I understand, reserves a portion of VRAM per GPU rather than greedily allocating as much as possible.
https://github.com/dhiltgen/ollama/blob/main/envconfig/config.go#L262

Note: The OLLAMA_GPU_OVERHEAD var is expecting the "reserved" VRAM memory in bytes!

<!-- gh-comment-id:3677964610 --> @stephanschielke commented on GitHub (Dec 20, 2025): _Just to spell it out because I landed here as well:_ `OLLAMA_MAX_VRAM` was deprecated and is no longer in use. It was removed here: [cc269ba](https://github.com/ollama/ollama/pull/5855/changes/cc269ba0943ee1fa0bddcce8027d0a6d1b86fec5#diff-8e494a434a8037b6c0b888e25b2baae7618fe65e792d4a155dadd096e9350667L1347) Apparently, the `OLLAMA_MAX_VRAM` environment variable was intended as a quick workaround to avoid `Out Of Memory` errors when working with multiple GPUs. Instead of setting a fixed VRAM limit, you can try setting `OLLAMA_GPU_OVERHEAD`, which, as far as I understand, reserves a portion of VRAM per GPU rather than greedily allocating as much as possible. https://github.com/dhiltgen/ollama/blob/main/envconfig/config.go#L262 **_Note: The `OLLAMA_GPU_OVERHEAD` var is expecting the "reserved" VRAM memory in `bytes`!_**
Author
Owner

@machineonamission commented on GitHub (Feb 20, 2026):

the OLLAMA_GPU_OVERHEAD is such a bizarre solution, because when the model is loaded it always makes sure it has that much vram free, which feels odd to me? like i have a server that runs a bunch of vram using stuff and i want to say "ollama can only use 2gb, calculate how best to use it", and this overhead thing will change the amount depending on the current free memory, which will obviously change
it's so odd that the config option was removed, and num_gpu isnt a great solution cause it requires per-model configuration and also isnt a GB limit its arbitrary

<!-- gh-comment-id:3937047419 --> @machineonamission commented on GitHub (Feb 20, 2026): the `OLLAMA_GPU_OVERHEAD` is such a bizarre solution, because when the model is loaded it always makes sure it has that much vram free, which feels odd to me? like i have a server that runs a bunch of vram using stuff and i want to say "ollama can only use 2gb, calculate how best to use it", and this overhead thing will change the amount depending on the current free memory, which will obviously change it's so odd that the config option was removed, and num_gpu isnt a great solution cause it requires per-model configuration and also isnt a GB limit its arbitrary
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65621