[GH-ISSUE #618] Trying to load too many layers, vram oom, reverts to cpu only. #26034

Closed
opened 2026-04-22 01:55:10 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @aaroncoffey on GitHub (Sep 27, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/618

Hi there,

Based on the logs, it appears that ollama is trying to load too many layers and crashing OOM, this is causing it to revert to CPU only mode, which is not desirable.

Logs:

2023/09/26 21:40:42 llama.go:310: starting llama runner
2023/09/26 21:40:42 llama.go:346: waiting for llama runner to start responding
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6
{"timestamp":1695789642,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"}
{"timestamp":1695789642,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":6,"total_threads":12,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | "}
llama.cpp: loading model from /home/user/.ollama/models/blobs/sha256:476d7ab8503b020bfee1e3c63403690f48422bb29c988ae74647c0c81b99e2a4
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 7168
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3060) as main device
llama_model_load_internal: mem required  = 4459.58 MB (+  640.00 MB per state)
llama_model_load_internal: allocating batch_size x (1280 kB + n_ctx x 256 B) = 896 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 71 repeating layers to GPU
llama_model_load_internal: offloaded 71/83 layers to GPU
llama_model_load_internal: total VRAM used: 24837 MB
CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml/ggml-cuda.cu:6184: out of memory
2023/09/26 21:41:02 llama.go:320: llama runner exited with error: exit status 1
2023/09/26 21:41:02 llama.go:327: error starting llama runner: llama runner process has terminated
2023/09/26 21:41:02 llama.go:310: starting llama runner
2023/09/26 21:41:02 llama.go:346: waiting for llama runner to start responding
{"timestamp":1695789662,"level":"WARNING","function":"server_params_parse","line":845,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":0}
{"timestamp":1695789662,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"}
{"timestamp":1695789662,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":6,"total_threads":12,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | "}
llama.cpp: loading model from /home/user/.ollama/models/blobs/sha256:476d7ab8503b020bfee1e3c63403690f48422bb29c988ae74647c0c81b99e2a4
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 7168
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 27615.90 MB (+  640.00 MB per state)
llama_new_context_with_model: kv self size  =  640.00 MB
llama_new_context_with_model: compute buffer total size =  305.35 MB

llama server listening at http://127.0.0.1:49467

Exposing some model card options to define how much vram to use from each video card, or even a percentage split would be helpful.
In my experience with oobabooga, I've found that the proper number of layers to offload will vary depending on the model. But with careful tuning, I can get each video card nearly maxed out.

Thanks!

Originally created by @aaroncoffey on GitHub (Sep 27, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/618 Hi there, Based on the logs, it appears that ollama is trying to load too many layers and crashing OOM, this is causing it to revert to CPU only mode, which is not desirable. Logs: ``` 2023/09/26 21:40:42 llama.go:310: starting llama runner 2023/09/26 21:40:42 llama.go:346: waiting for llama runner to start responding ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6 Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6 {"timestamp":1695789642,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"} {"timestamp":1695789642,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":6,"total_threads":12,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | "} llama.cpp: loading model from /home/user/.ollama/models/blobs/sha256:476d7ab8503b020bfee1e3c63403690f48422bb29c988ae74647c0c81b99e2a4 llama_model_load_internal: warning: assuming 70B model based on GQA == 8 llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32001 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 8192 llama_model_load_internal: n_mult = 7168 llama_model_load_internal: n_head = 64 llama_model_load_internal: n_head_kv = 8 llama_model_load_internal: n_layer = 80 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 8 llama_model_load_internal: rnorm_eps = 5.0e-06 llama_model_load_internal: n_ff = 28672 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 10 (mostly Q2_K) llama_model_load_internal: model size = 70B llama_model_load_internal: ggml ctx size = 0.21 MB llama_model_load_internal: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3060) as main device llama_model_load_internal: mem required = 4459.58 MB (+ 640.00 MB per state) llama_model_load_internal: allocating batch_size x (1280 kB + n_ctx x 256 B) = 896 MB VRAM for the scratch buffer llama_model_load_internal: offloading 71 repeating layers to GPU llama_model_load_internal: offloaded 71/83 layers to GPU llama_model_load_internal: total VRAM used: 24837 MB CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml/ggml-cuda.cu:6184: out of memory 2023/09/26 21:41:02 llama.go:320: llama runner exited with error: exit status 1 2023/09/26 21:41:02 llama.go:327: error starting llama runner: llama runner process has terminated 2023/09/26 21:41:02 llama.go:310: starting llama runner 2023/09/26 21:41:02 llama.go:346: waiting for llama runner to start responding {"timestamp":1695789662,"level":"WARNING","function":"server_params_parse","line":845,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":0} {"timestamp":1695789662,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"} {"timestamp":1695789662,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":6,"total_threads":12,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | "} llama.cpp: loading model from /home/user/.ollama/models/blobs/sha256:476d7ab8503b020bfee1e3c63403690f48422bb29c988ae74647c0c81b99e2a4 llama_model_load_internal: warning: assuming 70B model based on GQA == 8 llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32001 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 8192 llama_model_load_internal: n_mult = 7168 llama_model_load_internal: n_head = 64 llama_model_load_internal: n_head_kv = 8 llama_model_load_internal: n_layer = 80 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 8 llama_model_load_internal: rnorm_eps = 5.0e-06 llama_model_load_internal: n_ff = 28672 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 10 (mostly Q2_K) llama_model_load_internal: model size = 70B llama_model_load_internal: ggml ctx size = 0.21 MB llama_model_load_internal: mem required = 27615.90 MB (+ 640.00 MB per state) llama_new_context_with_model: kv self size = 640.00 MB llama_new_context_with_model: compute buffer total size = 305.35 MB llama server listening at http://127.0.0.1:49467 ``` Exposing some model card options to define how much vram to use from each video card, or even a percentage split would be helpful. In my experience with oobabooga, I've found that the proper number of layers to offload will vary depending on the model. But with careful tuning, I can get each video card nearly maxed out. Thanks!
Author
Owner

@BruceMacD commented on GitHub (Sep 27, 2023):

Right now Ollama is making an estimate of how many model layers can fit in VRAM by comparing the size of the model to the available memory, there's still improvements to be made here for sure. In the meantime the number of model layers to load into VRAM can be manually specified in the modelfile.

Here is what that looks like:

  1. Create the Modelfile from whatever model you wish to use. For example, lets load 50 layers into VRAM.
FROM llama2
PARAMETER num_gpu 50
  1. Create the model runner with the specified settings.
    ollama create llama2:vram-50 -f path/to/Modelfile
  2. Run the customized model.
    ollama run llama2:vram-50
<!-- gh-comment-id:1737547046 --> @BruceMacD commented on GitHub (Sep 27, 2023): Right now Ollama is making an estimate of how many model layers can fit in VRAM by comparing the size of the model to the available memory, there's still improvements to be made here for sure. In the meantime the number of model layers to load into VRAM can be manually specified in the modelfile. Here is what that looks like: 1) Create the Modelfile from whatever model you wish to use. For example, lets load 50 layers into VRAM. ``` FROM llama2 PARAMETER num_gpu 50 ``` 2) Create the model runner with the specified settings. `ollama create llama2:vram-50 -f path/to/Modelfile` 3) Run the customized model. `ollama run llama2:vram-50`
Author
Owner

@geekodour commented on GitHub (Sep 28, 2023):

From the docs: https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md it mentions, for num_gpu but not the number of model layers.

The number of GPUs to use. On macOS it defaults to 1 to enable metal support, 0 to disable.

Also is there a way to ensure GPU is being picked up?
https://github.com/ggerganov/llama.cpp/blob/master/docs/token_generation_performance_tips.md

I am able to use see ant offloading related logs in the ollama logs plus the nvtop is not showing any significat usage as-well.

<!-- gh-comment-id:1738330042 --> @geekodour commented on GitHub (Sep 28, 2023): From the docs: https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md it mentions, for `num_gpu` but not the number of model layers. > The number of GPUs to use. On macOS it defaults to 1 to enable metal support, 0 to disable. Also is there a way to ensure GPU is being picked up? https://github.com/ggerganov/llama.cpp/blob/master/docs/token_generation_performance_tips.md I am able to use see ant `offloading` related logs in the ollama logs plus the `nvtop` is not showing any significat usage as-well.
Author
Owner

@aaroncoffey commented on GitHub (Sep 28, 2023):

Thank you Bruce.

I can confirm that setting num_gpu helps, though as noted by @geekodour the docs do not reflect this.

On multi-gpu systems, it's very helpful to be able to define how many layers or how much vram can be used by each gpu.

In my testing of the above, 50 layers only used ~17GB of vram out of the combined available 24, but the split was uneven resulting on one gpu being OOM, while the other was only about half used. 45 layers gave ~11.5GB to one GPU and ~8 to another. This didn't OOM, but could be better utilized.

<!-- gh-comment-id:1738400092 --> @aaroncoffey commented on GitHub (Sep 28, 2023): Thank you Bruce. I can confirm that setting num_gpu helps, though as noted by @geekodour the docs do not reflect this. On multi-gpu systems, it's very helpful to be able to define how many layers or how much vram can be used by each gpu. In my testing of the above, 50 layers only used ~17GB of vram out of the combined available 24, but the split was uneven resulting on one gpu being OOM, while the other was only about half used. 45 layers gave ~11.5GB to one GPU and ~8 to another. This didn't OOM, but could be better utilized.
Author
Owner

@jtoy commented on GitHub (Sep 29, 2023):

is there a way to force cpu only for testing purposes?

<!-- gh-comment-id:1740854963 --> @jtoy commented on GitHub (Sep 29, 2023): is there a way to force cpu only for testing purposes?
Author
Owner

@BruceMacD commented on GitHub (Sep 29, 2023):

@jtoy Yup, you can set num_gpu to 0 then no layers will loaded onto the GPU.

  1. Create the Modelfile from whatever model you wish to use and set num_gpu to 0.
FROM llama2
PARAMETER num_gpu 0
  1. Create the model runner with the specified settings.
    ollama create llama2:cpu -f path/to/Modelfile
  2. Run the customized model.
    ollama run llama2:cpu
<!-- gh-comment-id:1740964659 --> @BruceMacD commented on GitHub (Sep 29, 2023): @jtoy Yup, you can set `num_gpu` to 0 then no layers will loaded onto the GPU. 1. Create the Modelfile from whatever model you wish to use and set `num_gpu` to 0. ``` FROM llama2 PARAMETER num_gpu 0 ``` 2. Create the model runner with the specified settings. `ollama create llama2:cpu -f path/to/Modelfile` 3. Run the customized model. `ollama run llama2:cpu`
Author
Owner

@technovangelist commented on GitHub (Dec 4, 2023):

It looks like Bruce has resolved this issue so I will go ahead and close it now. If you think there is anything we left out, reopen and we can address. Thanks for being part of this great community.

<!-- gh-comment-id:1839374406 --> @technovangelist commented on GitHub (Dec 4, 2023): It looks like Bruce has resolved this issue so I will go ahead and close it now. If you think there is anything we left out, reopen and we can address. Thanks for being part of this great community.
Author
Owner

@BananaAcid commented on GitHub (Dec 16, 2023):

@jtoy Yup, you can set num_gpu to 0 then no layers will loaded onto the GPU.

  1. Create the Modelfile from whatever model you wish to use and set num_gpu to 0.
FROM llama2
PARAMETER num_gpu 0
  1. Create the model runner with the specified settings.
    ollama create llama2:cpu -f path/to/Modelfile
  2. Run the customized model.
    ollama run llama2:cpu

I believe this would be helpful to be mentioned in the readme.md

<!-- gh-comment-id:1858936939 --> @BananaAcid commented on GitHub (Dec 16, 2023): > @jtoy Yup, you can set `num_gpu` to 0 then no layers will loaded onto the GPU. > > 1. Create the Modelfile from whatever model you wish to use and set `num_gpu` to 0. > > ``` > FROM llama2 > PARAMETER num_gpu 0 > ``` > > 2. Create the model runner with the specified settings. > `ollama create llama2:cpu -f path/to/Modelfile` > 3. Run the customized model. > `ollama run llama2:cpu` I believe this would be helpful to be mentioned in the readme.md
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26034