[GH-ISSUE #1947] CUDA out of memory error with multi-GPU of different sizes #1121

Closed
opened 2026-04-12 10:52:01 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @m0wer on GitHub (Jan 12, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1947

Originally assigned to: @mxyng on GitHub.

With two GPUs (RTX 2060 6GB + RTX 3090 24GB) and ollama 1.2.0 I get a OOM + ollama crash. In previous versions, it would have only tried to fit 28/33 layers in VRAM and that worked. This could be related to https://github.com/jmorganca/ollama/issues/1385

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 8
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 24.62 GiB (4.53 BPW)
llm_load_print_meta: general.name     = cognitivecomputations
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.38 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  955.85 MiB
llm_load_tensors: offloading 31 repeating layers to GPU
llm_load_tensors: offloaded 31/33 layers to GPU
llm_load_tensors: VRAM used: 24260.41 MiB
.............................................................................................
CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:9007: out of memory
current device: 1
GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:9007: !"CUDA error"
SIGABRT: abort
PC=0x7f59828cb9fc m=7 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 11 [syscall]:
runtime.cgocall(0x9c0710, 0xc0004de608)
        /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc0004de5e0 sp=0xc0004de5a8 pc=0x4266ab
github.com/jmorganca/ollama/llm._Cfunc_dynamic_shim_llama_server_init({0x7f591c001280, 0x7f58c7d4b7b0, 0x7f58c7d3ed90, 0x7f58c7d41150, 0x7f58c7d58680, 0x7f58c7d48ca0, 0x7f58c7d40ff0, 0x7f58c7d3ee30, 0x7f58c7d587b0, 0x7f58c7d58b50, ...}, ...)
        _cgo_gotypes.go:291 +0x45 fp=0xc0004de608 sp=0xc0004de5e0 pc=0x7cce45
github.com/jmorganca/ollama/llm.(*shimExtServer).llama_server_init.func1(0x456c1b?, 0x80?, 0x80?)
        /go/src/github.com/jmorganca/ollama/llm/shim_ext_server.go:40 +0xec fp=0xc0004de6f8 sp=0xc0004de608 pc=0x7d220c
github.com/jmorganca/ollama/llm.(*shimExtServer).llama_server_init(0xc0000942d0?, 0x0?, 0x4377c8?)
        /go/src/github.com/jmorganca/ollama/llm/shim_ext_server.go:40 +0x13 fp=0xc0004de720 sp=0xc0004de6f8 pc=0x7d20f3
github.com/jmorganca/ollama/llm.newExtServer({0x2b39d1d8, 0xc0004d4120}, {0xc0004ce150, _}, {_, _, _}, {0x0, 0x0, 0x0}, ...)
        /go/src/github.com/jmorganca/ollama/llm/ext_server_common.go:139 +0x70e fp=0xc0004de8e0 sp=0xc0004de720 pc=0x7ce38e
Originally created by @m0wer on GitHub (Jan 12, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1947 Originally assigned to: @mxyng on GitHub. With two GPUs (RTX 2060 6GB + RTX 3090 24GB) and ollama 1.2.0 I get a OOM + ollama crash. In previous versions, it would have only tried to fit 28/33 layers in VRAM and that worked. This could be related to https://github.com/jmorganca/ollama/issues/1385 ``` llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 8 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 46.70 B llm_load_print_meta: model size = 24.62 GiB (4.53 BPW) llm_load_print_meta: general.name = cognitivecomputations llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 32000 '<|im_end|>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.38 MiB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: mem required = 955.85 MiB llm_load_tensors: offloading 31 repeating layers to GPU llm_load_tensors: offloaded 31/33 layers to GPU llm_load_tensors: VRAM used: 24260.41 MiB ............................................................................................. CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:9007: out of memory current device: 1 GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:9007: !"CUDA error" SIGABRT: abort PC=0x7f59828cb9fc m=7 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 11 [syscall]: runtime.cgocall(0x9c0710, 0xc0004de608) /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc0004de5e0 sp=0xc0004de5a8 pc=0x4266ab github.com/jmorganca/ollama/llm._Cfunc_dynamic_shim_llama_server_init({0x7f591c001280, 0x7f58c7d4b7b0, 0x7f58c7d3ed90, 0x7f58c7d41150, 0x7f58c7d58680, 0x7f58c7d48ca0, 0x7f58c7d40ff0, 0x7f58c7d3ee30, 0x7f58c7d587b0, 0x7f58c7d58b50, ...}, ...) _cgo_gotypes.go:291 +0x45 fp=0xc0004de608 sp=0xc0004de5e0 pc=0x7cce45 github.com/jmorganca/ollama/llm.(*shimExtServer).llama_server_init.func1(0x456c1b?, 0x80?, 0x80?) /go/src/github.com/jmorganca/ollama/llm/shim_ext_server.go:40 +0xec fp=0xc0004de6f8 sp=0xc0004de608 pc=0x7d220c github.com/jmorganca/ollama/llm.(*shimExtServer).llama_server_init(0xc0000942d0?, 0x0?, 0x4377c8?) /go/src/github.com/jmorganca/ollama/llm/shim_ext_server.go:40 +0x13 fp=0xc0004de720 sp=0xc0004de6f8 pc=0x7d20f3 github.com/jmorganca/ollama/llm.newExtServer({0x2b39d1d8, 0xc0004d4120}, {0xc0004ce150, _}, {_, _, _}, {0x0, 0x0, 0x0}, ...) /go/src/github.com/jmorganca/ollama/llm/ext_server_common.go:139 +0x70e fp=0xc0004de8e0 sp=0xc0004de720 pc=0x7ce38e ```
GiteaMirror added the bugnvidia labels 2026-04-12 10:52:01 -05:00
Author
Owner

@jmorganca commented on GitHub (Jan 12, 2024):

Hi there! Thanks for the issue. Would it be possible to share the output of nvidia-smi? This will help me debug why it might be happening.

That said, I think I know what it is: there's still some work to do for Ollama to schedule over GPUs of that is still in progress (sorry!). Right now it will allocate most of the memory equally across all cards, which may be what's leading to a crash here since half of the memory required for the model alone wouldn't fit on the 6GB card.

<!-- gh-comment-id:1888581659 --> @jmorganca commented on GitHub (Jan 12, 2024): Hi there! Thanks for the issue. Would it be possible to share the output of `nvidia-smi`? This will help me debug why it might be happening. That said, I think I know what it is: there's still some work to do for Ollama to schedule over GPUs of that is still in progress (sorry!). Right now it will allocate most of the memory equally across all cards, which may be what's leading to a crash here since half of the memory required for the model alone wouldn't fit on the 6GB card.
Author
Owner

@m0wer commented on GitHub (Jan 12, 2024):

Sure! The latest working version is 0.18.0 with CUDA_VISIBLE_DEVICES=0,1, which looks like:

08:51:04 root@sgn:~# nvidia-smi
Fri Jan 12 08:51:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060        On  | 00000000:06:00.0 Off |                  N/A |
| 34%   27C    P8              14W / 128W |   5719MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  | 00000000:07:00.0 Off |                  N/A |
| 30%   38C    P8              26W / 280W |  20389MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    134219      C   /bin/ollama                                5714MiB |
|    1   N/A  N/A    134219      C   /bin/ollama                               20378MiB |
+---------------------------------------------------------------------------------------+

Even in 0.18.0 if I change the order of the cards to 1,0 (large VRAM one first) it also crashes. WIth 0.19.0 and 0.20.0 it crashes always for both possible orders of the GPUs.

<!-- gh-comment-id:1888600538 --> @m0wer commented on GitHub (Jan 12, 2024): Sure! The latest working version is `0.18.0` with `CUDA_VISIBLE_DEVICES=0,1`, which looks like: ``` 08:51:04 root@sgn:~# nvidia-smi Fri Jan 12 08:51:08 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 2060 On | 00000000:06:00.0 Off | N/A | | 34% 27C P8 14W / 128W | 5719MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 On | 00000000:07:00.0 Off | N/A | | 30% 38C P8 26W / 280W | 20389MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 134219 C /bin/ollama 5714MiB | | 1 N/A N/A 134219 C /bin/ollama 20378MiB | +---------------------------------------------------------------------------------------+ ``` Even in 0.18.0 if I change the order of the cards to 1,0 (large VRAM one first) it also crashes. WIth 0.19.0 and 0.20.0 it crashes always for both possible orders of the GPUs.
Author
Owner

@m0wer commented on GitHub (Jan 12, 2024):

Even in 0.18.0 it crashes from time ot time after some use or larger context. In the state shown above the memory details for the RTX 2060 6GB are:

    FB Memory Usage
        Total                             : 6144 MiB
        Reserved                          : 217 MiB
        Used                              : 5719 MiB
        Free                              : 206 MiB

So it's already pretty tight there (while the other one has plenty free space).

<!-- gh-comment-id:1888604394 --> @m0wer commented on GitHub (Jan 12, 2024): Even in 0.18.0 it crashes from time ot time after some use or larger context. In the state shown above the memory details for the RTX 2060 6GB are: ``` FB Memory Usage Total : 6144 MiB Reserved : 217 MiB Used : 5719 MiB Free : 206 MiB ``` So it's already pretty tight there (while the other one has plenty free space).
Author
Owner

@pdevine commented on GitHub (May 17, 2024):

@m0wer do you know if this is still an issue? I don't have asymmetric cards to test this on.

<!-- gh-comment-id:2116447436 --> @pdevine commented on GitHub (May 17, 2024): @m0wer do you know if this is still an issue? I don't have asymmetric cards to test this on.
Author
Owner

@m0wer commented on GitHub (May 20, 2024):

@m0wer do you know if this is still an issue? I don't have asymmetric cards to test this on.

Tested with the same setup and version v0.1.38 and works perfectly! Thanks all :-)

<!-- gh-comment-id:2121130263 --> @m0wer commented on GitHub (May 20, 2024): > @m0wer do you know if this is still an issue? I don't have asymmetric cards to test this on. Tested with the same setup and version v0.1.38 and works perfectly! Thanks all :-)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1121