[GH-ISSUE #2752] CUDA error: out of memory #48170

Closed
opened 2026-04-28 06:58:53 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @kennethwork101 on GitHub (Feb 25, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2752

CUDA error: out of memory

ollama version is 0.1.27
windows 11 wsl2 ubuntu 22.04
RTX 4070 TI

Running a set of tests with each test loading a different model using ollama.
It takes some time during testing we ran into the CUDA error: out of memory 3 times.
Note each of the models being loaded is less than 10 GB in size and the RTX 4070 TI should have 12 GB VRAM

Is this an issue with ollama or I should reduce the number of test?

..................................................................................................^M
llama_new_context_with_model: n_ctx = 4096^M
llama_new_context_with_model: freq_base = 10000.0^M
llama_new_context_with_model: freq_scale = 1^M
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes^M
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no^M
ggml_init_cublas: found 1 CUDA devices:^M
Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes^M
llama_kv_cache_init: CUDA0 KV buffer size = 2048.00 MiB^M
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB^M
llama_new_context_with_model: CUDA_Host input buffer size = 17.04 MiB^M
llama_new_context_with_model: CUDA0 compute buffer size = 296.02 MiB^M
llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB^M
llama_new_context_with_model: graph splits (measure): 3^M
CUDA error: out of memory^M
current device: 0, in function ggml_cuda_pool_malloc_vmm at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:7976^M
cuMemAddressReserve(&g_cuda_pool_addr[device], CUDA_POOL_VMM_MAX_SIZE, 0, 0, 0)^M
GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:243: !"CUDA error"^M
SIGABRT: abort^M
PC=0x7fbf6e5f79fc m=26 sigcode=18446744073709551610^M
signal arrived during cgo execution^M
^M
goroutine 296 [syscall, 13 minutes]:^M
runtime.cgocall(0x9bcdd0, 0xc000520748)^M
/usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc000520720 sp=0xc0005206e8 pc=0x409b0b^M
github.com/jmorganca/ollama/llm._Cfunc_dyn_llama_server_init({0x7fbed8001270, 0x7fbecc44c250, 0x7fbecc43cca0, 0x7fbecc43ff20, 0x7fbecc44fc00, 0x7fbecc449840, 0x7fbecc43fba0, 0x7fbecc43cd20, 0x7fbecc450500, 0x7fbecc44f7a0, ...}, ...)^M

The second error is similar but some of the buffer sizes are different:

ggml_init_cublas: found 1 CUDA devices:^M
Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes^M
llama_kv_cache_init: CUDA0 KV buffer size = 768.00 MiB^M
llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB^M
llama_new_context_with_model: CUDA_Host input buffer size = 17.04 MiB^M
llama_new_context_with_model: CUDA0 compute buffer size = 296.02 MiB^M
llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB^M
llama_new_context_with_model: graph splits (measure): 3^M
CUDA error: out of memory^


ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes
llama_kv_cache_init: CUDA0 KV buffer size = 768.00 MiB
llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 17.04 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 296.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB
llama_new_context_with_model: graph splits (measure): 3
[GIN] 2024/02/25 - 14:24:49 | 200 | 421.392µs | 127.0.0.1 | GET "/api/version"
CUDA error: out of memory

Originally created by @kennethwork101 on GitHub (Feb 25, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2752 CUDA error: out of memory ollama version is 0.1.27 windows 11 wsl2 ubuntu 22.04 RTX 4070 TI Running a set of tests with each test loading a different model using ollama. It takes some time during testing we ran into the CUDA error: out of memory 3 times. Note each of the models being loaded is less than 10 GB in size and the RTX 4070 TI should have 12 GB VRAM Is this an issue with ollama or I should reduce the number of test? ..................................................................................................^M llama_new_context_with_model: n_ctx = 4096^M llama_new_context_with_model: freq_base = 10000.0^M llama_new_context_with_model: freq_scale = 1^M ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes^M ggml_init_cublas: CUDA_USE_TENSOR_CORES: no^M ggml_init_cublas: found 1 CUDA devices:^M Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes^M llama_kv_cache_init: CUDA0 KV buffer size = 2048.00 MiB^M llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB^M llama_new_context_with_model: CUDA_Host input buffer size = 17.04 MiB^M llama_new_context_with_model: CUDA0 compute buffer size = 296.02 MiB^M llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB^M llama_new_context_with_model: graph splits (measure): 3^M CUDA error: out of memory^M current device: 0, in function ggml_cuda_pool_malloc_vmm at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:7976^M cuMemAddressReserve(&g_cuda_pool_addr[device], CUDA_POOL_VMM_MAX_SIZE, 0, 0, 0)^M GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:243: !"CUDA error"^M SIGABRT: abort^M PC=0x7fbf6e5f79fc m=26 sigcode=18446744073709551610^M signal arrived during cgo execution^M ^M goroutine 296 [syscall, 13 minutes]:^M runtime.cgocall(0x9bcdd0, 0xc000520748)^M /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc000520720 sp=0xc0005206e8 pc=0x409b0b^M github.com/jmorganca/ollama/llm._Cfunc_dyn_llama_server_init({0x7fbed8001270, 0x7fbecc44c250, 0x7fbecc43cca0, 0x7fbecc43ff20, 0x7fbecc44fc00, 0x7fbecc449840, 0x7fbecc43fba0, 0x7fbecc43cd20, 0x7fbecc450500, 0x7fbecc44f7a0, ...}, ...)^M The second error is similar but some of the buffer sizes are different: ggml_init_cublas: found 1 CUDA devices:^M Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes^M llama_kv_cache_init: CUDA0 KV buffer size = 768.00 MiB^M llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB^M llama_new_context_with_model: CUDA_Host input buffer size = 17.04 MiB^M llama_new_context_with_model: CUDA0 compute buffer size = 296.02 MiB^M llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB^M llama_new_context_with_model: graph splits (measure): 3^M CUDA error: out of memory^ ---------- ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes llama_kv_cache_init: CUDA0 KV buffer size = 768.00 MiB llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 17.04 MiB llama_new_context_with_model: CUDA0 compute buffer size = 296.02 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 3 [GIN] 2024/02/25 - 14:24:49 | 200 | 421.392µs | 127.0.0.1 | GET "/api/version" CUDA error: out of memory
Author
Owner

@jmorganca commented on GitHub (Feb 25, 2024):

Hi there, sorry this happened. Any OOM error is definitely an issue we will fix in Ollama directly. In this case, is the error happening immediately or after several runs (or model loads/unloads)? Thanks!

<!-- gh-comment-id:1963091978 --> @jmorganca commented on GitHub (Feb 25, 2024): Hi there, sorry this happened. Any OOM error is definitely an issue we will fix in Ollama directly. In this case, is the error happening immediately or after several runs (or model loads/unloads)? Thanks!
Author
Owner

@jmorganca commented on GitHub (Feb 25, 2024):

Will merge with https://github.com/ollama/ollama/issues/1952 since it's most likely the root issue

<!-- gh-comment-id:1963092181 --> @jmorganca commented on GitHub (Feb 25, 2024): Will merge with https://github.com/ollama/ollama/issues/1952 since it's most likely the root issue
Author
Owner

@kennethwork101 commented on GitHub (Feb 25, 2024):

It took about 50 minutes of testing before this happened.
Note different models are being loaded during testing but they are all under 10 GB in size.

<!-- gh-comment-id:1963093222 --> @kennethwork101 commented on GitHub (Feb 25, 2024): It took about 50 minutes of testing before this happened. Note different models are being loaded during testing but they are all under 10 GB in size.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48170