[GH-ISSUE #1341] MultiGPU: not splitting model to multiple GPUs - CUDA out of memory #62735

New Issue

GiteaMirror · 2026-05-03T10:08:53-05:00

GiteaMirror commented

2026-05-03 10:08:53 -05:00

Originally created by @chymian on GitHub (Dec 1, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1341

Originally assigned to: @mxyng on GitHub.

trying to load a model (deepseek-coder) to 2 GPUs fails with OOM-error.

the setup:
Linux: ubu 22.04
HW: i5-7400 (AVX, AVX2), 32GB
GPU: 4 x 3070 8GB
ollama: 0.1.12, running in docker
nvidia-smi from within the container shows 2 x 3070.

Because of the big contect-size, I want to load the model on 2 GPUs, but it never uses the second one and fails, after reaching OOM at the first GPU.

modelfile:

ollama show --modelfile coder-16k
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM coder-16k:latest

FROM deepseek-coder:6.7b-base-q5_0
TEMPLATE """{{ .Prompt }}"""
PARAMETER num_ctx 16384
PARAMETER num_gpu 128
PARAMETER num_predict 756
PARAMETER seed 42
PARAMETER temperature 0.1
PARAMETER top_k 22
PARAMETER top_p 0.5

AVX:
it does not recognize/report AVX2 as you can see in the log.

HINT:
num_gpus, describing "layers to offload" ist most missleading.
your paramter num_gpus which is used at all other loaders, like fastchat, oooba's, vllm, etc. to describes the numbers of GPUS to use is very missleading.

IMHO, parameter-names like that, would be more telling:

tensor_split: amount of GPUs to use
offload_layers: number of layers to offload
gpus: which GPUs to use like CUDA_VISIBLE_DEVICES

here the log of the failure.
thats the part where it OOM-erros on GPU0 and start loading to CPU

...
ollama-GPU23  | llm_load_print_meta: LF token  = 126 'Ä'                                                                            
ollama-GPU23  | llm_load_tensors: ggml ctx size =    0.11 MiB                                                                       
ollama-GPU23  | llm_load_tensors: using CUDA for GPU acceleration                                                                   
ollama-GPU23  | ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3070) as main device                                  
ollama-GPU23  | llm_load_tensors: mem required  =   86.73 MiB                                                                       
ollama-GPU23  | llm_load_tensors: offloading 32 repeating layers to GPU                                                    
ollama-GPU23  | llm_load_tensors: offloading non-repeating layers to GPU                                                   
ollama-GPU23  | llm_load_tensors: offloaded 35/35 layers to GPU                                                                     
ollama-GPU23  | llm_load_tensors: VRAM used: 4350.38 MiB                                                                            
ollama-GPU23  | ..................................................................................................                  
ollama-GPU23  | llama_new_context_with_model: n_ctx      = 16384                                                                    
ollama-GPU23  | llama_new_context_with_model: freq_base  = 100000.0                                                                 
ollama-GPU23  | llama_new_context_with_model: freq_scale = 0.25                                                                     
ollama-GPU23  | llama_kv_cache_init: offloading v cache to GPU                                                                      
ollama-GPU23  |                                                                                                                     
ollama-GPU23  | CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7957: out of memory             
ollama-GPU23  | current device: 0                                                                                                   
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:436: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7957: out
 of memory                                                                                                                          
ollama-GPU23  | current device: 0                                                                                                   
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:444: error starting llama runner: llama runner process has terminated         
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:510: llama runner stopped successfully                                        
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:421: starting llama runner                                                             
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:479: waiting for llama runner to start responding                             
ollama-GPU23  | {"timestamp":1701415674,"level":"WARNING","function":"server_params_parse","line":2035,"message":"Not compiled with 
GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_
layers":-1}                                                                                                                         
ollama-GPU23  | {"timestamp":1701415674,"level":"INFO","function":"main","line":2534,"message":"build info","build":375,"commit":"96
56026"}                                                                                                                             
ollama-GPU23  | {"timestamp":1701415674,"level":"INFO","function":"main","line":2537,"message":"system info","n_threads":4,"n_thread
s_batch":-1,"total_threads":4,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 
0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}                           
ollama-GPU23  | llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256:
5d80d0c539a5c90b360fbb2bc49261f3e28fae0e937452aea3948788c40cbba7 (version GGUF V2)                                                  
ollama-GPU23  | 
...

Originally created by @chymian on GitHub (Dec 1, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1341 Originally assigned to: @mxyng on GitHub. trying to load a model (deepseek-coder) to 2 GPUs fails with OOM-error. __the setup:__ Linux: ubu 22.04 HW: i5-7400 (AVX, AVX2), 32GB GPU: 4 x 3070 8GB ollama: 0.1.12, running in docker nvidia-smi from within the container shows 2 x 3070. Because of the big contect-size, I want to load the model on 2 GPUs, but it never uses the second one and fails, after reaching OOM at the first GPU. __modelfile:__ ``` ollama show --modelfile coder-16k # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM coder-16k:latest FROM deepseek-coder:6.7b-base-q5_0 TEMPLATE """{{ .Prompt }}""" PARAMETER num_ctx 16384 PARAMETER num_gpu 128 PARAMETER num_predict 756 PARAMETER seed 42 PARAMETER temperature 0.1 PARAMETER top_k 22 PARAMETER top_p 0.5 ``` __AVX:__ it does not recognize/report AVX2 as you can see in the log. __HINT:__ `num_gpus`, describing "layers to offload" ist most missleading. your paramter `num_gpus` which is used at all other loaders, like fastchat, oooba's, vllm, etc. to describes the numbers of GPUS to use is very missleading. IMHO, parameter-names like that, would be more telling: - tensor_split: amount of GPUs to use - offload_layers: number of layers to offload - gpus: which GPUs to use like `CUDA_VISIBLE_DEVICES` here the log of the failure. thats the part where it OOM-erros on GPU0 and start loading to CPU ```log ... ollama-GPU23 | llm_load_print_meta: LF token = 126 'Ä' ollama-GPU23 | llm_load_tensors: ggml ctx size = 0.11 MiB ollama-GPU23 | llm_load_tensors: using CUDA for GPU acceleration ollama-GPU23 | ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3070) as main device ollama-GPU23 | llm_load_tensors: mem required = 86.73 MiB ollama-GPU23 | llm_load_tensors: offloading 32 repeating layers to GPU ollama-GPU23 | llm_load_tensors: offloading non-repeating layers to GPU ollama-GPU23 | llm_load_tensors: offloaded 35/35 layers to GPU ollama-GPU23 | llm_load_tensors: VRAM used: 4350.38 MiB ollama-GPU23 | .................................................................................................. ollama-GPU23 | llama_new_context_with_model: n_ctx = 16384 ollama-GPU23 | llama_new_context_with_model: freq_base = 100000.0 ollama-GPU23 | llama_new_context_with_model: freq_scale = 0.25 ollama-GPU23 | llama_kv_cache_init: offloading v cache to GPU ollama-GPU23 | ollama-GPU23 | CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7957: out of memory ollama-GPU23 | current device: 0 ollama-GPU23 | 2023/12/01 07:27:54 llama.go:436: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7957: out of memory ollama-GPU23 | current device: 0 ollama-GPU23 | 2023/12/01 07:27:54 llama.go:444: error starting llama runner: llama runner process has terminated ollama-GPU23 | 2023/12/01 07:27:54 llama.go:510: llama runner stopped successfully ollama-GPU23 | 2023/12/01 07:27:54 llama.go:421: starting llama runner ollama-GPU23 | 2023/12/01 07:27:54 llama.go:479: waiting for llama runner to start responding ollama-GPU23 | {"timestamp":1701415674,"level":"WARNING","function":"server_params_parse","line":2035,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_ layers":-1} ollama-GPU23 | {"timestamp":1701415674,"level":"INFO","function":"main","line":2534,"message":"build info","build":375,"commit":"96 56026"} ollama-GPU23 | {"timestamp":1701415674,"level":"INFO","function":"main","line":2537,"message":"system info","n_threads":4,"n_thread s_batch":-1,"total_threads":4,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "} ollama-GPU23 | llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256: 5d80d0c539a5c90b360fbb2bc49261f3e28fae0e937452aea3948788c40cbba7 (version GGUF V2) ollama-GPU23 | ... ```

GiteaMirror added the bug nvidia labels 2026-05-03 10:08:55 -05:00

GiteaMirror closed this issue

2026-05-03 10:08:56 -05:00

GiteaMirror commented

2026-05-03 10:08:58 -05:00

@mlewis1973 commented on GitHub (Dec 1, 2023):

we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs.

@mlewis1973 commented on GitHub (Dec 1, 2023): we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs.

GiteaMirror commented

2026-05-03 10:09:00 -05:00

@chymian commented on GitHub (Dec 2, 2023):

we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs.

and does it split models to different GPU's depending on VRAM? that doesn't work for me.
if you have that running, can you pls. post your modelfile, docker-compose, version etc.

@chymian commented on GitHub (Dec 2, 2023): > we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs. and does it split models to different GPU's depending on VRAM? that doesn't work for me. if you have that running, can you pls. post your modelfile, docker-compose, version etc.

GiteaMirror commented

2026-05-03 10:09:00 -05:00

@anuradhawick commented on GitHub (Dec 3, 2023):

It seems to me like the tasks are divided among the GPUs. I could not find any documentation to support this though.

This is for model: llama2:13b

ollama-runner has two processes on each GPU. In the logs I read as follows;

2023/12/03 18:30:11 llama.go:292: 46054 MB VRAM available, loading up to 196 GPU layers
2023/12/03 18:30:11 llama.go:421: starting llama runner
2023/12/03 18:30:11 llama.go:479: waiting for llama runner to start responding
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6
...........................LAYER INFO.............................
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  =   88.02 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 6936.01 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 3200.00 MiB
llama_new_context_with_model: kv self size  = 3200.00 MiB
llama_build_graph: non-view tensors processed: 924/924
llama_new_context_with_model: compute buffer total size = 361.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 358.00 MiB
llama_new_context_with_model: total VRAM used: 10494.01 MiB (model: 6936.01 MiB, context: 3558.00 MiB)

Division of memory seems to be asymmetric, probably for good reasons.

@anuradhawick commented on GitHub (Dec 3, 2023): It seems to me like the tasks are divided among the GPUs. I could not find any documentation to support this though. This is for model: `llama2:13b` `ollama-runner` has two processes on each GPU. In the logs I read as follows; ``` 2023/12/03 18:30:11 llama.go:292: 46054 MB VRAM available, loading up to 196 GPU layers 2023/12/03 18:30:11 llama.go:421: starting llama runner 2023/12/03 18:30:11 llama.go:479: waiting for llama runner to start responding ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6 Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6 ...........................LAYER INFO............................. llm_load_tensors: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device llm_load_tensors: mem required = 88.02 MiB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 43/43 layers to GPU llm_load_tensors: VRAM used: 6936.01 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: offloading v cache to GPU llama_kv_cache_init: offloading k cache to GPU llama_kv_cache_init: VRAM kv self = 3200.00 MiB llama_new_context_with_model: kv self size = 3200.00 MiB llama_build_graph: non-view tensors processed: 924/924 llama_new_context_with_model: compute buffer total size = 361.07 MiB llama_new_context_with_model: VRAM scratch buffer: 358.00 MiB llama_new_context_with_model: total VRAM used: 10494.01 MiB (model: 6936.01 MiB, context: 3558.00 MiB) ``` Division of memory seems to be asymmetric, probably for good reasons.

GiteaMirror commented

2026-05-03 10:09:02 -05:00

@mlewis1973 commented on GitHub (Dec 4, 2023):

we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs.

and does it split models to different GPU's depending on VRAM? that doesn't work for me. if you have that running, can you pls. post your modelfile, docker-compose, version etc.

no docker-compose......
Ubuntu 20
$ docker info
Client: Docker Engine - Community
Version: 24.0.7
....
Runtimes: io.containerd.runc.v2 nvidia runc
.....

$ docker run --gpus=all -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

$ curl http://localhost:11434/api/generate -d '{"model": "llama2","prompt": "Who was the most famous person of all time?","stream":false}'
{"model":"llama2","created_at":"2023-12-04T17:52:38.826266993Z","response":" Determining the most famous person of all time is a difficult task, as it depends on various factors such as cultural context, historical period, and personal opinions. However, here are some of the most renowned individuals throughout history who have had a significant impact on human civilization:\n\n1. Jesus Christ: Known as the central figure of Christianity, Jesus is considered by many to be the most famous person in history. His teachings, life, death, and resurrection have had a profound impact on billions of people around the world.\n2. Muhammad: As the prophet of Islam, Muhammad is revered by over 1.8 billion Muslims globally. His teachings and example have shaped the lives of millions of people for centuries,.....

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2316 G /usr/lib/xorg/Xorg 46MiB |
| 0 N/A N/A 2382 G /usr/bin/gnome-shell 13MiB |
| 0 N/A N/A 162891 C ...ffice/program/soffice.bin 145MiB |
| 0 N/A N/A 182102 C python 2461MiB |
| 0 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 3325MiB |
| 1 N/A N/A 2316 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 1915MiB |
| 2 N/A N/A 2316 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 1981MiB |
+-----------------------------------------------------------------------------+

@mlewis1973 commented on GitHub (Dec 4, 2023): > > we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs. > > and does it split models to different GPU's depending on VRAM? that doesn't work for me. if you have that running, can you pls. post your modelfile, docker-compose, version etc. no docker-compose...... Ubuntu 20 $ docker info Client: Docker Engine - Community Version: 24.0.7 .... Runtimes: io.containerd.runc.v2 nvidia runc ..... $ docker run --gpus=all -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama $ curl http://localhost:11434/api/generate -d '{"model": "llama2","prompt": "Who was the most famous person of all time?","stream":false}' {"model":"llama2","created_at":"2023-12-04T17:52:38.826266993Z","response":" Determining the most famous person of all time is a difficult task, as it depends on various factors such as cultural context, historical period, and personal opinions. However, here are some of the most renowned individuals throughout history who have had a significant impact on human civilization:\n\n1. Jesus Christ: Known as the central figure of Christianity, Jesus is considered by many to be the most famous person in history. His teachings, life, death, and resurrection have had a profound impact on billions of people around the world.\n2. Muhammad: As the prophet of Islam, Muhammad is revered by over 1.8 billion Muslims globally. His teachings and example have shaped the lives of millions of people for centuries,..... Mon Dec 4 11:53:13 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.223.02 Driver Version: 470.223.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA TITAN Xp Off | 00000000:05:00.0 Off | N/A | | 27% 46C P8 10W / 250W | 5997MiB / 12192MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA TITAN Xp Off | 00000000:09:00.0 Off | N/A | | 23% 42C P8 10W / 250W | 1925MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA TITAN Xp Off | 00000000:0B:00.0 Off | N/A | | 23% 39C P8 10W / 250W | 1991MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2316 G /usr/lib/xorg/Xorg 46MiB | | 0 N/A N/A 2382 G /usr/bin/gnome-shell 13MiB | | 0 N/A N/A 162891 C ...ffice/program/soffice.bin 145MiB | | 0 N/A N/A 182102 C python 2461MiB | | 0 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 3325MiB | | 1 N/A N/A 2316 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 1915MiB | | 2 N/A N/A 2316 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 1981MiB | +-----------------------------------------------------------------------------+

GiteaMirror commented

2026-05-03 10:09:04 -05:00

@Stampede commented on GitHub (Dec 10, 2023):

trying to load a model (deepseek-coder) to 2 GPUs fails with OOM-error.

the setup: Linux: ubu 22.04 HW: i5-7400 (AVX, AVX2), 32GB GPU: 4 x 3070 8GB ollama: 0.1.12, running in docker nvidia-smi from within the container shows 2 x 3070.

For what it's worth, I have similar system specs as you do, and I am getting the same error log messages.

Out of memory errors and also:

"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support"

I am not running the docker version of ollama; I downloaded the binary to my ~/bin directory and run it from there.

ollama version 0.1.13

ollama was working fine until I put in the 2nd GPU.

@Stampede commented on GitHub (Dec 10, 2023): > trying to load a model (deepseek-coder) to 2 GPUs fails with OOM-error. > > **the setup:** Linux: ubu 22.04 HW: i5-7400 (AVX, AVX2), 32GB GPU: 4 x 3070 8GB ollama: 0.1.12, running in docker nvidia-smi from within the container shows 2 x 3070. For what it's worth, I have similar system specs as you do, and I am getting the same error log messages. Out of memory errors and also: > "Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support" I am not running the docker version of ollama; I downloaded the binary to my ~/bin directory and run it from there. ollama version 0.1.13 ollama was working fine until I put in the 2nd GPU.

GiteaMirror commented

2026-05-03 10:09:07 -05:00

@dhiltgen commented on GitHub (Mar 12, 2024):

We've been making improvements to our memory prediction algorithm, but it still isn't perfect yet. In general, there's a chunk of memory that gets allocated on the first GPU, then the remainder is spread evenly across the GPUs.

In the next release (0.1.29) we'll be adding a new setting that can allow you to set a lower VRAM setting to workaround this type of crash until we get the prediction logic fixed. OLLAMA_MAX_VRAM=<bytes> For example, you could start with 30G and experiment until you find a setting that loads the as many layers as possible without hitting the OOM crash. OLLAMA_MAX_VRAM=32212254720

@dhiltgen commented on GitHub (Mar 12, 2024): We've been making improvements to our memory prediction algorithm, but it still isn't perfect yet. In general, there's a chunk of memory that gets allocated on the first GPU, then the remainder is spread evenly across the GPUs. In the next release (0.1.29) we'll be adding a new setting that can allow you to set a lower VRAM setting to workaround this type of crash until we get the prediction logic fixed. `OLLAMA_MAX_VRAM=<bytes>` For example, you could start with 30G and experiment until you find a setting that loads the as many layers as possible without hitting the OOM crash. `OLLAMA_MAX_VRAM=32212254720`

GiteaMirror commented

2026-05-03 10:09:10 -05:00

@insunaa commented on GitHub (Mar 14, 2024):

Can you make the parameter parse units? For example OLLAMA_MAX_VRAM=30G or OLLAMA_MAX_VRAM=1T etc.

@insunaa commented on GitHub (Mar 14, 2024): Can you make the parameter parse units? For example `OLLAMA_MAX_VRAM=30G` or `OLLAMA_MAX_VRAM=1T` etc.

GiteaMirror commented

2026-05-03 10:09:12 -05:00

@dhiltgen commented on GitHub (Mar 21, 2024):

@insunaa this variable is only meant to be a temporary workaround until we get the memory prediction fixed so OOM crashes no longer happen.

@dhiltgen commented on GitHub (Mar 21, 2024): @insunaa this variable is only meant to be a temporary workaround until we get the memory prediction fixed so OOM crashes no longer happen.

GiteaMirror commented

2026-05-03 10:09:14 -05:00

@dhiltgen commented on GitHub (May 2, 2024):

The latest release 0.1.33 further refines our handling of multi-GPU setups, and our memory prediction algorithms. Please give it a try and let us know if you're still seeing problems.

https://github.com/ollama/ollama/releases

@dhiltgen commented on GitHub (May 2, 2024): The latest release 0.1.33 further refines our handling of multi-GPU setups, and our memory prediction algorithms. Please give it a try and let us know if you're still seeing problems. https://github.com/ollama/ollama/releases

GiteaMirror commented

2026-05-03 10:09:16 -05:00

@jmorganca commented on GitHub (May 9, 2024):

This should be fixed now – however we are still working on multi-gpu memory allocation so please do share any issues you're hitting!

@jmorganca commented on GitHub (May 9, 2024): This should be fixed now – however we are still working on multi-gpu memory allocation so please do share any issues you're hitting!

Sign in to join this conversation.

Branches Tags

main

parth-remove-ollama-agent-command

parth-agent-harness-skills-synthetic-tool

hoyyeva/fix-anthropic-text-before-thinking

parth-agent-cli-markdown-rendering

mxyng/docs-cloud

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#62735