[GH-ISSUE #1341] MultiGPU: not splitting model to multiple GPUs - CUDA out of memory #62735

Closed
opened 2026-05-03 10:08:53 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @chymian on GitHub (Dec 1, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1341

Originally assigned to: @mxyng on GitHub.

trying to load a model (deepseek-coder) to 2 GPUs fails with OOM-error.

the setup:
Linux: ubu 22.04
HW: i5-7400 (AVX, AVX2), 32GB
GPU: 4 x 3070 8GB
ollama: 0.1.12, running in docker
nvidia-smi from within the container shows 2 x 3070.

Because of the big contect-size, I want to load the model on 2 GPUs, but it never uses the second one and fails, after reaching OOM at the first GPU.

modelfile:

ollama show --modelfile coder-16k
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM coder-16k:latest

FROM deepseek-coder:6.7b-base-q5_0
TEMPLATE """{{ .Prompt }}"""
PARAMETER num_ctx 16384
PARAMETER num_gpu 128
PARAMETER num_predict 756
PARAMETER seed 42
PARAMETER temperature 0.1
PARAMETER top_k 22
PARAMETER top_p 0.5

AVX:
it does not recognize/report AVX2 as you can see in the log.

HINT:
num_gpus, describing "layers to offload" ist most missleading.
your paramter num_gpus which is used at all other loaders, like fastchat, oooba's, vllm, etc. to describes the numbers of GPUS to use is very missleading.

IMHO, parameter-names like that, would be more telling:

  • tensor_split: amount of GPUs to use
  • offload_layers: number of layers to offload
  • gpus: which GPUs to use like CUDA_VISIBLE_DEVICES

here the log of the failure.
thats the part where it OOM-erros on GPU0 and start loading to CPU

...
ollama-GPU23  | llm_load_print_meta: LF token  = 126 'Ä'                                                                            
ollama-GPU23  | llm_load_tensors: ggml ctx size =    0.11 MiB                                                                       
ollama-GPU23  | llm_load_tensors: using CUDA for GPU acceleration                                                                   
ollama-GPU23  | ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3070) as main device                                  
ollama-GPU23  | llm_load_tensors: mem required  =   86.73 MiB                                                                       
ollama-GPU23  | llm_load_tensors: offloading 32 repeating layers to GPU                                                    
ollama-GPU23  | llm_load_tensors: offloading non-repeating layers to GPU                                                   
ollama-GPU23  | llm_load_tensors: offloaded 35/35 layers to GPU                                                                     
ollama-GPU23  | llm_load_tensors: VRAM used: 4350.38 MiB                                                                            
ollama-GPU23  | ..................................................................................................                  
ollama-GPU23  | llama_new_context_with_model: n_ctx      = 16384                                                                    
ollama-GPU23  | llama_new_context_with_model: freq_base  = 100000.0                                                                 
ollama-GPU23  | llama_new_context_with_model: freq_scale = 0.25                                                                     
ollama-GPU23  | llama_kv_cache_init: offloading v cache to GPU                                                                      
ollama-GPU23  |                                                                                                                     
ollama-GPU23  | CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7957: out of memory             
ollama-GPU23  | current device: 0                                                                                                   
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:436: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7957: out
 of memory                                                                                                                          
ollama-GPU23  | current device: 0                                                                                                   
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:444: error starting llama runner: llama runner process has terminated         
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:510: llama runner stopped successfully                                        
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:421: starting llama runner                                                             
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:479: waiting for llama runner to start responding                             
ollama-GPU23  | {"timestamp":1701415674,"level":"WARNING","function":"server_params_parse","line":2035,"message":"Not compiled with 
GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_
layers":-1}                                                                                                                         
ollama-GPU23  | {"timestamp":1701415674,"level":"INFO","function":"main","line":2534,"message":"build info","build":375,"commit":"96
56026"}                                                                                                                             
ollama-GPU23  | {"timestamp":1701415674,"level":"INFO","function":"main","line":2537,"message":"system info","n_threads":4,"n_thread
s_batch":-1,"total_threads":4,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 
0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}                           
ollama-GPU23  | llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256:
5d80d0c539a5c90b360fbb2bc49261f3e28fae0e937452aea3948788c40cbba7 (version GGUF V2)                                                  
ollama-GPU23  | 
...
Originally created by @chymian on GitHub (Dec 1, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1341 Originally assigned to: @mxyng on GitHub. trying to load a model (deepseek-coder) to 2 GPUs fails with OOM-error. __the setup:__ Linux: ubu 22.04 HW: i5-7400 (AVX, AVX2), 32GB GPU: 4 x 3070 8GB ollama: 0.1.12, running in docker nvidia-smi from within the container shows 2 x 3070. Because of the big contect-size, I want to load the model on 2 GPUs, but it never uses the second one and fails, after reaching OOM at the first GPU. __modelfile:__ ``` ollama show --modelfile coder-16k # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM coder-16k:latest FROM deepseek-coder:6.7b-base-q5_0 TEMPLATE """{{ .Prompt }}""" PARAMETER num_ctx 16384 PARAMETER num_gpu 128 PARAMETER num_predict 756 PARAMETER seed 42 PARAMETER temperature 0.1 PARAMETER top_k 22 PARAMETER top_p 0.5 ``` __AVX:__ it does not recognize/report AVX2 as you can see in the log. __HINT:__ `num_gpus`, describing "layers to offload" ist most missleading. your paramter `num_gpus` which is used at all other loaders, like fastchat, oooba's, vllm, etc. to describes the numbers of GPUS to use is very missleading. IMHO, parameter-names like that, would be more telling: - tensor_split: amount of GPUs to use - offload_layers: number of layers to offload - gpus: which GPUs to use like `CUDA_VISIBLE_DEVICES` here the log of the failure. thats the part where it OOM-erros on GPU0 and start loading to CPU ```log ... ollama-GPU23 | llm_load_print_meta: LF token = 126 'Ä' ollama-GPU23 | llm_load_tensors: ggml ctx size = 0.11 MiB ollama-GPU23 | llm_load_tensors: using CUDA for GPU acceleration ollama-GPU23 | ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3070) as main device ollama-GPU23 | llm_load_tensors: mem required = 86.73 MiB ollama-GPU23 | llm_load_tensors: offloading 32 repeating layers to GPU ollama-GPU23 | llm_load_tensors: offloading non-repeating layers to GPU ollama-GPU23 | llm_load_tensors: offloaded 35/35 layers to GPU ollama-GPU23 | llm_load_tensors: VRAM used: 4350.38 MiB ollama-GPU23 | .................................................................................................. ollama-GPU23 | llama_new_context_with_model: n_ctx = 16384 ollama-GPU23 | llama_new_context_with_model: freq_base = 100000.0 ollama-GPU23 | llama_new_context_with_model: freq_scale = 0.25 ollama-GPU23 | llama_kv_cache_init: offloading v cache to GPU ollama-GPU23 | ollama-GPU23 | CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7957: out of memory ollama-GPU23 | current device: 0 ollama-GPU23 | 2023/12/01 07:27:54 llama.go:436: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7957: out of memory ollama-GPU23 | current device: 0 ollama-GPU23 | 2023/12/01 07:27:54 llama.go:444: error starting llama runner: llama runner process has terminated ollama-GPU23 | 2023/12/01 07:27:54 llama.go:510: llama runner stopped successfully ollama-GPU23 | 2023/12/01 07:27:54 llama.go:421: starting llama runner ollama-GPU23 | 2023/12/01 07:27:54 llama.go:479: waiting for llama runner to start responding ollama-GPU23 | {"timestamp":1701415674,"level":"WARNING","function":"server_params_parse","line":2035,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_ layers":-1} ollama-GPU23 | {"timestamp":1701415674,"level":"INFO","function":"main","line":2534,"message":"build info","build":375,"commit":"96 56026"} ollama-GPU23 | {"timestamp":1701415674,"level":"INFO","function":"main","line":2537,"message":"system info","n_threads":4,"n_thread s_batch":-1,"total_threads":4,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "} ollama-GPU23 | llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256: 5d80d0c539a5c90b360fbb2bc49261f3e28fae0e937452aea3948788c40cbba7 (version GGUF V2) ollama-GPU23 | ... ```
GiteaMirror added the bugnvidia labels 2026-05-03 10:08:55 -05:00
Author
Owner

@mlewis1973 commented on GitHub (Dec 1, 2023):

we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs.

<!-- gh-comment-id:1836474483 --> @mlewis1973 commented on GitHub (Dec 1, 2023): we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs.
Author
Owner

@chymian commented on GitHub (Dec 2, 2023):

we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs.

and does it split models to different GPU's depending on VRAM? that doesn't work for me.
if you have that running, can you pls. post your modelfile, docker-compose, version etc.

<!-- gh-comment-id:1837105658 --> @chymian commented on GitHub (Dec 2, 2023): > we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs. and does it split models to different GPU's depending on VRAM? that doesn't work for me. if you have that running, can you pls. post your modelfile, docker-compose, version etc.
Author
Owner

@anuradhawick commented on GitHub (Dec 3, 2023):

It seems to me like the tasks are divided among the GPUs. I could not find any documentation to support this though.

This is for model: llama2:13b

ollama-runner has two processes on each GPU. In the logs I read as follows;

2023/12/03 18:30:11 llama.go:292: 46054 MB VRAM available, loading up to 196 GPU layers
2023/12/03 18:30:11 llama.go:421: starting llama runner
2023/12/03 18:30:11 llama.go:479: waiting for llama runner to start responding
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6
...........................LAYER INFO.............................
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  =   88.02 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 6936.01 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 3200.00 MiB
llama_new_context_with_model: kv self size  = 3200.00 MiB
llama_build_graph: non-view tensors processed: 924/924
llama_new_context_with_model: compute buffer total size = 361.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 358.00 MiB
llama_new_context_with_model: total VRAM used: 10494.01 MiB (model: 6936.01 MiB, context: 3558.00 MiB)

Division of memory seems to be asymmetric, probably for good reasons.

<!-- gh-comment-id:1837402461 --> @anuradhawick commented on GitHub (Dec 3, 2023): It seems to me like the tasks are divided among the GPUs. I could not find any documentation to support this though. This is for model: `llama2:13b` `ollama-runner` has two processes on each GPU. In the logs I read as follows; ``` 2023/12/03 18:30:11 llama.go:292: 46054 MB VRAM available, loading up to 196 GPU layers 2023/12/03 18:30:11 llama.go:421: starting llama runner 2023/12/03 18:30:11 llama.go:479: waiting for llama runner to start responding ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6 Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6 ...........................LAYER INFO............................. llm_load_tensors: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device llm_load_tensors: mem required = 88.02 MiB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 43/43 layers to GPU llm_load_tensors: VRAM used: 6936.01 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: offloading v cache to GPU llama_kv_cache_init: offloading k cache to GPU llama_kv_cache_init: VRAM kv self = 3200.00 MiB llama_new_context_with_model: kv self size = 3200.00 MiB llama_build_graph: non-view tensors processed: 924/924 llama_new_context_with_model: compute buffer total size = 361.07 MiB llama_new_context_with_model: VRAM scratch buffer: 358.00 MiB llama_new_context_with_model: total VRAM used: 10494.01 MiB (model: 6936.01 MiB, context: 3558.00 MiB) ``` Division of memory seems to be asymmetric, probably for good reasons.
Author
Owner

@mlewis1973 commented on GitHub (Dec 4, 2023):

we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs.

and does it split models to different GPU's depending on VRAM? that doesn't work for me. if you have that running, can you pls. post your modelfile, docker-compose, version etc.

no docker-compose......
Ubuntu 20
$ docker info
Client: Docker Engine - Community
Version: 24.0.7
....
Runtimes: io.containerd.runc.v2 nvidia runc
.....

$ docker run --gpus=all -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

$ curl http://localhost:11434/api/generate -d '{"model": "llama2","prompt": "Who was the most famous person of all time?","stream":false}'
{"model":"llama2","created_at":"2023-12-04T17:52:38.826266993Z","response":" Determining the most famous person of all time is a difficult task, as it depends on various factors such as cultural context, historical period, and personal opinions. However, here are some of the most renowned individuals throughout history who have had a significant impact on human civilization:\n\n1. Jesus Christ: Known as the central figure of Christianity, Jesus is considered by many to be the most famous person in history. His teachings, life, death, and resurrection have had a profound impact on billions of people around the world.\n2. Muhammad: As the prophet of Islam, Muhammad is revered by over 1.8 billion Muslims globally. His teachings and example have shaped the lives of millions of people for centuries,.....

Mon Dec 4 11:53:13 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02 Driver Version: 470.223.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA TITAN Xp Off | 00000000:05:00.0 Off | N/A |
| 27% 46C P8 10W / 250W | 5997MiB / 12192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN Xp Off | 00000000:09:00.0 Off | N/A |
| 23% 42C P8 10W / 250W | 1925MiB / 12196MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA TITAN Xp Off | 00000000:0B:00.0 Off | N/A |
| 23% 39C P8 10W / 250W | 1991MiB / 12196MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2316 G /usr/lib/xorg/Xorg 46MiB |
| 0 N/A N/A 2382 G /usr/bin/gnome-shell 13MiB |
| 0 N/A N/A 162891 C ...ffice/program/soffice.bin 145MiB |
| 0 N/A N/A 182102 C python 2461MiB |
| 0 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 3325MiB |
| 1 N/A N/A 2316 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 1915MiB |
| 2 N/A N/A 2316 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 1981MiB |
+-----------------------------------------------------------------------------+

<!-- gh-comment-id:1839177411 --> @mlewis1973 commented on GitHub (Dec 4, 2023): > > we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs. > > and does it split models to different GPU's depending on VRAM? that doesn't work for me. if you have that running, can you pls. post your modelfile, docker-compose, version etc. no docker-compose...... Ubuntu 20 $ docker info Client: Docker Engine - Community Version: 24.0.7 .... Runtimes: io.containerd.runc.v2 nvidia runc ..... $ docker run --gpus=all -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama $ curl http://localhost:11434/api/generate -d '{"model": "llama2","prompt": "Who was the most famous person of all time?","stream":false}' {"model":"llama2","created_at":"2023-12-04T17:52:38.826266993Z","response":" Determining the most famous person of all time is a difficult task, as it depends on various factors such as cultural context, historical period, and personal opinions. However, here are some of the most renowned individuals throughout history who have had a significant impact on human civilization:\n\n1. Jesus Christ: Known as the central figure of Christianity, Jesus is considered by many to be the most famous person in history. His teachings, life, death, and resurrection have had a profound impact on billions of people around the world.\n2. Muhammad: As the prophet of Islam, Muhammad is revered by over 1.8 billion Muslims globally. His teachings and example have shaped the lives of millions of people for centuries,..... Mon Dec 4 11:53:13 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.223.02 Driver Version: 470.223.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA TITAN Xp Off | 00000000:05:00.0 Off | N/A | | 27% 46C P8 10W / 250W | 5997MiB / 12192MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA TITAN Xp Off | 00000000:09:00.0 Off | N/A | | 23% 42C P8 10W / 250W | 1925MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA TITAN Xp Off | 00000000:0B:00.0 Off | N/A | | 23% 39C P8 10W / 250W | 1991MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2316 G /usr/lib/xorg/Xorg 46MiB | | 0 N/A N/A 2382 G /usr/bin/gnome-shell 13MiB | | 0 N/A N/A 162891 C ...ffice/program/soffice.bin 145MiB | | 0 N/A N/A 182102 C python 2461MiB | | 0 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 3325MiB | | 1 N/A N/A 2316 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 1915MiB | | 2 N/A N/A 2316 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 1981MiB | +-----------------------------------------------------------------------------+
Author
Owner

@Stampede commented on GitHub (Dec 10, 2023):

trying to load a model (deepseek-coder) to 2 GPUs fails with OOM-error.

the setup: Linux: ubu 22.04 HW: i5-7400 (AVX, AVX2), 32GB GPU: 4 x 3070 8GB ollama: 0.1.12, running in docker nvidia-smi from within the container shows 2 x 3070.

For what it's worth, I have similar system specs as you do, and I am getting the same error log messages.

Out of memory errors and also:

"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support"

I am not running the docker version of ollama; I downloaded the binary to my ~/bin directory and run it from there.

ollama version 0.1.13

ollama was working fine until I put in the 2nd GPU.

<!-- gh-comment-id:1849132282 --> @Stampede commented on GitHub (Dec 10, 2023): > trying to load a model (deepseek-coder) to 2 GPUs fails with OOM-error. > > **the setup:** Linux: ubu 22.04 HW: i5-7400 (AVX, AVX2), 32GB GPU: 4 x 3070 8GB ollama: 0.1.12, running in docker nvidia-smi from within the container shows 2 x 3070. For what it's worth, I have similar system specs as you do, and I am getting the same error log messages. Out of memory errors and also: > "Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support" I am not running the docker version of ollama; I downloaded the binary to my ~/bin directory and run it from there. ollama version 0.1.13 ollama was working fine until I put in the 2nd GPU.
Author
Owner

@dhiltgen commented on GitHub (Mar 12, 2024):

We've been making improvements to our memory prediction algorithm, but it still isn't perfect yet. In general, there's a chunk of memory that gets allocated on the first GPU, then the remainder is spread evenly across the GPUs.

In the next release (0.1.29) we'll be adding a new setting that can allow you to set a lower VRAM setting to workaround this type of crash until we get the prediction logic fixed. OLLAMA_MAX_VRAM=<bytes> For example, you could start with 30G and experiment until you find a setting that loads the as many layers as possible without hitting the OOM crash. OLLAMA_MAX_VRAM=32212254720

<!-- gh-comment-id:1992021790 --> @dhiltgen commented on GitHub (Mar 12, 2024): We've been making improvements to our memory prediction algorithm, but it still isn't perfect yet. In general, there's a chunk of memory that gets allocated on the first GPU, then the remainder is spread evenly across the GPUs. In the next release (0.1.29) we'll be adding a new setting that can allow you to set a lower VRAM setting to workaround this type of crash until we get the prediction logic fixed. `OLLAMA_MAX_VRAM=<bytes>` For example, you could start with 30G and experiment until you find a setting that loads the as many layers as possible without hitting the OOM crash. `OLLAMA_MAX_VRAM=32212254720`
Author
Owner

@insunaa commented on GitHub (Mar 14, 2024):

Can you make the parameter parse units? For example OLLAMA_MAX_VRAM=30G or OLLAMA_MAX_VRAM=1T etc.

<!-- gh-comment-id:1997542269 --> @insunaa commented on GitHub (Mar 14, 2024): Can you make the parameter parse units? For example `OLLAMA_MAX_VRAM=30G` or `OLLAMA_MAX_VRAM=1T` etc.
Author
Owner

@dhiltgen commented on GitHub (Mar 21, 2024):

@insunaa this variable is only meant to be a temporary workaround until we get the memory prediction fixed so OOM crashes no longer happen.

<!-- gh-comment-id:2012362288 --> @dhiltgen commented on GitHub (Mar 21, 2024): @insunaa this variable is only meant to be a temporary workaround until we get the memory prediction fixed so OOM crashes no longer happen.
Author
Owner

@dhiltgen commented on GitHub (May 2, 2024):

The latest release 0.1.33 further refines our handling of multi-GPU setups, and our memory prediction algorithms. Please give it a try and let us know if you're still seeing problems.

https://github.com/ollama/ollama/releases

<!-- gh-comment-id:2091678626 --> @dhiltgen commented on GitHub (May 2, 2024): The latest release 0.1.33 further refines our handling of multi-GPU setups, and our memory prediction algorithms. Please give it a try and let us know if you're still seeing problems. https://github.com/ollama/ollama/releases
Author
Owner

@jmorganca commented on GitHub (May 9, 2024):

This should be fixed now – however we are still working on multi-gpu memory allocation so please do share any issues you're hitting!

<!-- gh-comment-id:2103520570 --> @jmorganca commented on GitHub (May 9, 2024): This should be fixed now – however we are still working on multi-gpu memory allocation so please do share any issues you're hitting!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62735