[GH-ISSUE #1105] Out of memory when using multiple GPUs #47064

Closed
opened 2026-04-28 02:57:17 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @BruceMacD on GitHub (Nov 12, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1105

Originally assigned to: @BruceMacD on GitHub.

When a system has multiple GPUs generation (ex: ollama run ...) may fail with an out of memory error.

Nov 05 22:41:50 example.com ollama[943528]: 2023/11/05 22:41:50 llama.go:259: 7197 MB VRAM available, loading up to 47 GPU layers
Nov 05 22:41:50 example.com ollama[943528]: 2023/11/05 22:41:50 llama.go:370: starting llama runner
Nov 05 22:41:50 example.com ollama[943528]: 2023/11/05 22:41:50 llama.go:428: waiting for llama runner to start responding
Nov 05 22:41:50 example.com ollama[943528]: ggml_init_cublas: found 2 CUDA devices:
Nov 05 22:41:50 example.com ollama[943528]:   Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6
Nov 05 22:41:50 example.com ollama[943528]:   Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6
Nov 05 22:41:52 example.com ollama[1418565]: {"timestamp":1699245712,"level":"INFO","function":"main","line":1323,"message":"build info","build":219,"commit":"9e70cc0"}
Nov 05 22:41:52 example.com ollama[1418565]: {"timestamp":1699245712,"level":"INFO","function":"main","line":1325,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":16,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
Nov 05 22:41:52 example.com ollama[943528]: llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256:22f7f8ef5f4c791c1b03d7eb414399294764d7cc82c7e94aa81a1feb80a983a2 (version GGUF V2 (latest))
Nov 05 22:41:52 example.com ollama[943528]: llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4096, 32000,     1,     1 ]
...
Nov 05 22:41:52 example.com ollama[943528]: llm_load_tensors: ggml ctx size =    0.10 MB
Nov 05 22:41:52 example.com ollama[943528]: llm_load_tensors: using CUDA for GPU acceleration
Nov 05 22:41:52 example.com ollama[943528]: ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3060 Ti) as main device
Nov 05 22:41:52 example.com ollama[943528]: llm_load_tensors: mem required  =   70.41 MB
Nov 05 22:41:52 example.com ollama[943528]: llm_load_tensors: offloading 32 repeating layers to GPU
Nov 05 22:41:52 example.com ollama[943528]: llm_load_tensors: offloading non-repeating layers to GPU
Nov 05 22:41:52 example.com ollama[943528]: llm_load_tensors: offloaded 35/35 layers to GPU
Nov 05 22:41:52 example.com ollama[943528]: llm_load_tensors: VRAM used: 3577.55 MB
Nov 05 22:41:53 example.com ollama[943528]: ....................................................................
Nov 05 22:41:53 example.com ollama[943528]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7233: out of memory
Nov 05 22:41:53 example.com ollama[943528]: current device: 0
Nov 05 22:41:53 example.com ollama[943528]: 2023/11/05 22:41:53 llama.go:385: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7233: out of memory
Nov 05 22:41:53 example.com ollama[943528]: current device: 0
Nov 05 22:41:53 example.com ollama[943528]: 2023/11/05 22:41:53 llama.go:393: error starting llama runner: llama runner process has terminated
Nov 05 22:41:53 example.com ollama[943528]: 2023/11/05 22:41:53 llama.go:459: llama runner stopped successfully

Possibly related:
https://github.com/ggerganov/llama.cpp/issues/1866
https://github.com/ggerganov/llama.cpp/issues/2432

Originally created by @BruceMacD on GitHub (Nov 12, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1105 Originally assigned to: @BruceMacD on GitHub. When a system has multiple GPUs generation (ex: `ollama run ...`) may fail with an `out of memory` error. ``` Nov 05 22:41:50 example.com ollama[943528]: 2023/11/05 22:41:50 llama.go:259: 7197 MB VRAM available, loading up to 47 GPU layers Nov 05 22:41:50 example.com ollama[943528]: 2023/11/05 22:41:50 llama.go:370: starting llama runner Nov 05 22:41:50 example.com ollama[943528]: 2023/11/05 22:41:50 llama.go:428: waiting for llama runner to start responding Nov 05 22:41:50 example.com ollama[943528]: ggml_init_cublas: found 2 CUDA devices: Nov 05 22:41:50 example.com ollama[943528]: Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6 Nov 05 22:41:50 example.com ollama[943528]: Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6 Nov 05 22:41:52 example.com ollama[1418565]: {"timestamp":1699245712,"level":"INFO","function":"main","line":1323,"message":"build info","build":219,"commit":"9e70cc0"} Nov 05 22:41:52 example.com ollama[1418565]: {"timestamp":1699245712,"level":"INFO","function":"main","line":1325,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":16,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "} Nov 05 22:41:52 example.com ollama[943528]: llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256:22f7f8ef5f4c791c1b03d7eb414399294764d7cc82c7e94aa81a1feb80a983a2 (version GGUF V2 (latest)) Nov 05 22:41:52 example.com ollama[943528]: llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 4096, 32000, 1, 1 ] ... Nov 05 22:41:52 example.com ollama[943528]: llm_load_tensors: ggml ctx size = 0.10 MB Nov 05 22:41:52 example.com ollama[943528]: llm_load_tensors: using CUDA for GPU acceleration Nov 05 22:41:52 example.com ollama[943528]: ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3060 Ti) as main device Nov 05 22:41:52 example.com ollama[943528]: llm_load_tensors: mem required = 70.41 MB Nov 05 22:41:52 example.com ollama[943528]: llm_load_tensors: offloading 32 repeating layers to GPU Nov 05 22:41:52 example.com ollama[943528]: llm_load_tensors: offloading non-repeating layers to GPU Nov 05 22:41:52 example.com ollama[943528]: llm_load_tensors: offloaded 35/35 layers to GPU Nov 05 22:41:52 example.com ollama[943528]: llm_load_tensors: VRAM used: 3577.55 MB Nov 05 22:41:53 example.com ollama[943528]: .................................................................... Nov 05 22:41:53 example.com ollama[943528]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7233: out of memory Nov 05 22:41:53 example.com ollama[943528]: current device: 0 Nov 05 22:41:53 example.com ollama[943528]: 2023/11/05 22:41:53 llama.go:385: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7233: out of memory Nov 05 22:41:53 example.com ollama[943528]: current device: 0 Nov 05 22:41:53 example.com ollama[943528]: 2023/11/05 22:41:53 llama.go:393: error starting llama runner: llama runner process has terminated Nov 05 22:41:53 example.com ollama[943528]: 2023/11/05 22:41:53 llama.go:459: llama runner stopped successfully ``` Possibly related: https://github.com/ggerganov/llama.cpp/issues/1866 https://github.com/ggerganov/llama.cpp/issues/2432
GiteaMirror added the bug label 2026-04-28 02:57:17 -05:00
Author
Owner

@phalexo commented on GitHub (Dec 4, 2023):

I get a similar error using multiple or a single GPU when the model is really too small for an OOM. The same models appear to work on the host. So if I set CUDA_VISIBLE_DEVICE='' it runs ok on the host.

<!-- gh-comment-id:1837663435 --> @phalexo commented on GitHub (Dec 4, 2023): I get a similar error using multiple or a single GPU when the model is really too small for an OOM. The same models appear to work on the host. So if I set CUDA_VISIBLE_DEVICE='' it runs ok on the host.
Author
Owner

@farhanhubble commented on GitHub (Dec 7, 2023):

I get a similar error using multiple or a single GPU when the model is really too small for an OOM. The same models appear to work on the host. So if I set CUDA_VISIBLE_DEVICE='' it runs ok on the host.

Nitpick: The environment variable should be CUDA_VISIBLE_DEVICES=''

<!-- gh-comment-id:1844684613 --> @farhanhubble commented on GitHub (Dec 7, 2023): > I get a similar error using multiple or a single GPU when the model is really too small for an OOM. The same models appear to work on the host. So if I set CUDA_VISIBLE_DEVICE='' it runs ok on the host. Nitpick: The environment variable should be ` CUDA_VISIBLE_DEVICES=''`
Author
Owner

@phalexo commented on GitHub (Dec 16, 2023):

git clone --recursive https://github.com/jmorganca/ollama.git
cd ollama/llm/llama.cpp
vi generate_linux.go
//go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_CUDA_FORCE_MMQ=on
//go:generate cmake --build ggml/build/cuda --target server --config Release
//go:generate mv ggml/build/cuda/bin/server ggml/build/cuda/bin/ollama-runner
//go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 -DLLAMA_CUDA_FORCE_MMQ=on
//go:generate cmake --build gguf/build/cuda --target server --config Release
//go:generate mv gguf/build/cuda/bin/server gguf/build/cuda/bin/ollama-runner
cd ../..
go generate ./...
go build .
<!-- gh-comment-id:1858918180 --> @phalexo commented on GitHub (Dec 16, 2023): ```bash git clone --recursive https://github.com/jmorganca/ollama.git cd ollama/llm/llama.cpp vi generate_linux.go ``` ```go //go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_CUDA_FORCE_MMQ=on //go:generate cmake --build ggml/build/cuda --target server --config Release //go:generate mv ggml/build/cuda/bin/server ggml/build/cuda/bin/ollama-runner //go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 -DLLAMA_CUDA_FORCE_MMQ=on //go:generate cmake --build gguf/build/cuda --target server --config Release //go:generate mv gguf/build/cuda/bin/server gguf/build/cuda/bin/ollama-runner ``` ```bash cd ../.. go generate ./... go build . ```
Author
Owner

@ex3ndr commented on GitHub (Dec 22, 2023):

I just finished 2x4090 build and getting same errors

<!-- gh-comment-id:1867326748 --> @ex3ndr commented on GitHub (Dec 22, 2023): I just finished 2x4090 build and getting same errors
Author
Owner

@jmorganca commented on GitHub (Jan 10, 2024):

This should be fixed with https://github.com/jmorganca/ollama/pull/1850, but feel free to re-open an issue if not

<!-- gh-comment-id:1884881019 --> @jmorganca commented on GitHub (Jan 10, 2024): This should be fixed with https://github.com/jmorganca/ollama/pull/1850, but feel free to re-open an issue if not
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47064