[GH-ISSUE #3301] Question: GPU not fully utilized when not all layers are offloaded #2032

Closed
opened 2026-04-12 12:15:25 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @TomTom101 on GitHub (Mar 22, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3301

I am running Mixtral 8x7B Q4 on a RTX 3090 with 24GB VRAM. 23/33 layers are offloaded to the GPU:

llm_load_tensors: offloading 23 repeating layers to GPU
llm_load_tensors: offloaded 23/33 layers to GPU
llm_load_tensors:        CPU buffer size = 25215.87 MiB
llm_load_tensors:      CUDA0 buffer size = 17999.66 MiB

Now during interference, the GPU utilization does never exceed 15%. I get ~15 tokens/s and mostly see this:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:08:00.0 Off |                  N/A |
| 30%   44C    P2             150W / 370W |  22070MiB / 24576MiB |     15%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Utilization is always >90% when I load Mistral 7B which is fully offloaded to the GPU (and is pretty fast with ~110 tokens/s)

Questions

  • Is such a low GPU utilization normal when "only" 70% of layers are offloaded?
  • What options do I have to increase GPU utilization? That thing is too expensive to have it sit idle ;)

Thanks!

Here is the full log ollama startup log:
time=2024-03-22T20:48:20.367Z level=INFO source=routes.go:76 msg="changing loaded model"
time=2024-03-22T20:48:20.624Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-22T20:48:20.624Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-22T20:48:20.624Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-22T20:48:20.624Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-22T20:48:20.624Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-22T20:48:20.624Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /root/.ollama/assets/0.1.28/cuda_v11/libext_server.so"
time=2024-03-22T20:48:20.624Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from /root/.ollama/models/blobs/sha256:3a17f7cde150070bbc815645693fb93c311cc42e7deaf198364acadcf08458f8 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:                         llama.expert_count u32              = 8
llama_model_loader: - kv  10:                    llama.expert_used_count u32              = 2
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,58980]   = ["▁ t", "i n", "e r", "▁ a", "h e...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:   32 tensors
llama_model_loader: - type q8_0:   64 tensors
llama_model_loader: - type q4_K:  833 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 8
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 24.62 GiB (4.53 BPW) 
llm_load_print_meta: general.name     = mistralai
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.76 MiB
llm_load_tensors: offloading 23 repeating layers to GPU
llm_load_tensors: offloaded 23/33 layers to GPU
llm_load_tensors:        CPU buffer size = 25215.87 MiB
llm_load_tensors:      CUDA0 buffer size = 17999.66 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_kv_cache_init:  CUDA_Host KV buffer size =    72.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   184.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    13.02 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   192.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   188.03 MiB
llama_new_context_with_model: graph splits (measure): 3
time=2024-03-22T20:48:22.828Z level=INFO source=dyn_ext_server.go:162 msg="Starting llama main loop"
Originally created by @TomTom101 on GitHub (Mar 22, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3301 I am running Mixtral 8x7B Q4 on a RTX 3090 with 24GB VRAM. 23/33 layers are offloaded to the GPU: ``` llm_load_tensors: offloading 23 repeating layers to GPU llm_load_tensors: offloaded 23/33 layers to GPU llm_load_tensors: CPU buffer size = 25215.87 MiB llm_load_tensors: CUDA0 buffer size = 17999.66 MiB ``` Now during interference, the GPU utilization does never exceed 15%. I get ~15 tokens/s and mostly see this: ``` +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:08:00.0 Off | N/A | | 30% 44C P2 150W / 370W | 22070MiB / 24576MiB | 15% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ ``` Utilization is always >90% when I load Mistral 7B which is fully offloaded to the GPU (and is pretty fast with ~110 tokens/s) ## Questions * Is such a low GPU utilization normal when "only" 70% of layers are offloaded? * What options do I have to increase GPU utilization? That thing is too expensive to have it sit idle ;) Thanks! <details> <summary>Here is the full log ollama startup log:</summary> ``` time=2024-03-22T20:48:20.367Z level=INFO source=routes.go:76 msg="changing loaded model" time=2024-03-22T20:48:20.624Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-22T20:48:20.624Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6" time=2024-03-22T20:48:20.624Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-22T20:48:20.624Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6" time=2024-03-22T20:48:20.624Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-22T20:48:20.624Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /root/.ollama/assets/0.1.28/cuda_v11/libext_server.so" time=2024-03-22T20:48:20.624Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from /root/.ollama/models/blobs/sha256:3a17f7cde150070bbc815645693fb93c311cc42e7deaf198364acadcf08458f8 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.expert_count u32 = 8 llama_model_loader: - kv 10: llama.expert_used_count u32 = 2 llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: general.file_type u32 = 15 llama_model_loader: - kv 14: tokenizer.ggml.model str = llama llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type f16: 32 tensors llama_model_loader: - type q8_0: 64 tensors llama_model_loader: - type q4_K: 833 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 8 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 46.70 B llm_load_print_meta: model size = 24.62 GiB (4.53 BPW) llm_load_print_meta: general.name = mistralai llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.76 MiB llm_load_tensors: offloading 23 repeating layers to GPU llm_load_tensors: offloaded 23/33 layers to GPU llm_load_tensors: CPU buffer size = 25215.87 MiB llm_load_tensors: CUDA0 buffer size = 17999.66 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes ggml_init_cublas: CUDA_USE_TENSOR_CORES: no ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes llama_kv_cache_init: CUDA_Host KV buffer size = 72.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 184.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 192.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 188.03 MiB llama_new_context_with_model: graph splits (measure): 3 time=2024-03-22T20:48:22.828Z level=INFO source=dyn_ext_server.go:162 msg="Starting llama main loop" ``` </details>
GiteaMirror added the question label 2026-04-12 12:15:25 -05:00
Author
Owner

@remy415 commented on GitHub (Mar 25, 2024):

llm_load_print_meta: model size = 24.62 GiB (4.53 BPW)

@TomTom101 It's likely because the model is too large to fit into your GPU memory, so it is split up and then you get <100% of the layers offloaded resulting in lower performance.

You might need to switch to a smaller model, the Mistral 7b you mentioned before is really good.

<!-- gh-comment-id:2018865932 --> @remy415 commented on GitHub (Mar 25, 2024): > llm_load_print_meta: model size = 24.62 GiB (4.53 BPW) @TomTom101 It's likely because the model is too large to fit into your GPU memory, so it is split up and then you get <100% of the layers offloaded resulting in lower performance. You might need to switch to a smaller model, the `Mistral 7b` you mentioned before is really good.
Author
Owner

@TomTom101 commented on GitHub (Mar 26, 2024):

Thanks @remy415 ! Sure, the model is too big to fully fit in VRAM. I just wonder why the GPU is not working at full load on the layers it chose to load. How would GPU utilization look like when 32 of 33 layers would fit in the VRAM? Would it still be utilized? More than 15%, 50% 95%?

My questions are still unanswered :)

Disclaimer: While not being a newbie at using generative AI and its models, I am a newbie at running them locally

<!-- gh-comment-id:2019782500 --> @TomTom101 commented on GitHub (Mar 26, 2024): Thanks @remy415 ! Sure, the model is too big to fully fit in VRAM. I just wonder why the GPU is not working at full load on the layers it chose to load. How would GPU utilization look like when 32 of 33 layers would fit in the VRAM? Would it still be utilized? More than 15%, 50% 95%? My questions are still unanswered :) Disclaimer: While not being a newbie at using generative AI and its models, I am a newbie at running them locally
Author
Owner

@remy415 commented on GitHub (Mar 26, 2024):

@TomTom101 I'm by no means an expert in how this works, but from what I understand:

This is an oversimplification, but if you have to offload processing to the CPU, and the portion the CPU does takes 85% of your time, then your GPU will be at 15% utilization tops.

As for the "chunking" of the work, it doesn't exactly work in a way that the GPU can "cycle through" the parts of work that need to be done as it would involve loading and unloading portions of the model; this takes a lot of time and would likely be much slower than just having the CPU take a portion of the work. I hope that makes more sense! If someone else has a better example, please help because I too would like to understand this better if I have it wrong.

<!-- gh-comment-id:2020368333 --> @remy415 commented on GitHub (Mar 26, 2024): @TomTom101 I'm by no means an expert in how this works, but from what I understand: This is an oversimplification, but if you have to offload processing to the CPU, and the portion the CPU does takes 85% of your time, then your GPU will be at 15% utilization tops. As for the "chunking" of the work, it doesn't exactly work in a way that the GPU can "cycle through" the parts of work that need to be done as it would involve loading and unloading portions of the model; this takes a lot of time and would likely be much slower than just having the CPU take a portion of the work. I hope that makes more sense! If someone else has a better example, please help because I too would like to understand this better if I have it wrong.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 19, 2024):

@remy415

This is an oversimplification, but if you have to offload processing to the CPU, and the portion the CPU does takes 85% of your time, then your GPU will be at 15% utilization tops.

As for the "chunking" of the work, it doesn't exactly work in a way that the GPU can "cycle through" the parts of work that need to be done as it would involve loading and unloading portions of the model; this takes a lot of time and would likely be much slower than just having the CPU take a portion of the work. I hope that makes more sense! If someone else has a better example, please help because I too would like to understand this better if I have it wrong.

Is this a correct understanding? The GPU utilization is low because it is "waiting" for the CPU. It is like two motors of different speeds (faster/slower) on the same output. The slowest motor will limit the maximum speed of the faster motor. The faster motor will have a lower utilization when the slower motor is fully utilized.
I imagine Ollama is sending commands to both the GPU and CPU. The GPU processes faster than the CPU and Ollama can't send the next command until the CPU has completed its task. The GPU will not process any instructions while the CPU is finishing and that brings down the GPU utilization.

<!-- gh-comment-id:2066616452 --> @MarkWard0110 commented on GitHub (Apr 19, 2024): @remy415 > This is an oversimplification, but if you have to offload processing to the CPU, and the portion the CPU does takes 85% of your time, then your GPU will be at 15% utilization tops. > > As for the "chunking" of the work, it doesn't exactly work in a way that the GPU can "cycle through" the parts of work that need to be done as it would involve loading and unloading portions of the model; this takes a lot of time and would likely be much slower than just having the CPU take a portion of the work. I hope that makes more sense! If someone else has a better example, please help because I too would like to understand this better if I have it wrong. Is this a correct understanding? The GPU utilization is low because it is "waiting" for the CPU. It is like two motors of different speeds (faster/slower) on the same output. The slowest motor will limit the maximum speed of the faster motor. The faster motor will have a lower utilization when the slower motor is fully utilized. I imagine Ollama is sending commands to both the GPU and CPU. The GPU processes faster than the CPU and Ollama can't send the next command until the CPU has completed its task. The GPU will not process any instructions while the CPU is finishing and that brings down the GPU utilization.
Author
Owner

@remy415 commented on GitHub (Apr 19, 2024):

@MarkWard0110 That more or less covers it

llm_load_tensors: offloaded 23/33 layers to GPU

In this, 10 layers are not offloaded to GPU. It takes the GPU (arbitrarily just to give an example) 5 seconds to process its 23 layers, and some of those layers may be waiting on inputs from the 10 CPU layers, or waiting to output to those layers. It really depends on how the layers are divided, but you essentially don't ever want to split them if you have GPU acceleration.

<!-- gh-comment-id:2066643338 --> @remy415 commented on GitHub (Apr 19, 2024): @MarkWard0110 That more or less covers it `llm_load_tensors: offloaded 23/33 layers to GPU` In this, 10 layers are not offloaded to GPU. It takes the GPU (arbitrarily just to give an example) 5 seconds to process its 23 layers, and some of those layers may be waiting on inputs from the 10 CPU layers, or waiting to output to those layers. It really depends on how the layers are divided, but you essentially don't ever want to split them if you have GPU acceleration.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 19, 2024):

@remy415 ,
I'm now curious why the GPU's RAM seems not fully utilized when loading large models like llama3:70b-instruct. Given the following hardware Nvidia RTX 4070 (16GB), Intel i9 14900k 96GB
How would I understand the utilization that I see? GPU memory utilization is 0-20%, with Ollama running the model on CPU.
Is there a size limit to what Ollama can even split among available resources?

Here is the log of when it loaded the model


Apr 19 16:54:15 quorra ollama[1180]: time=2024-04-19T16:54:15.973Z level=INFO source=routes.go:97 msg="changing loaded model"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.636Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.636Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.636Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.636Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.636Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.666Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.681Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.681Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.681Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.681Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.681Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.706Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.721Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=30 layers=30 required="38968.0 MiB" used="15573.3 MiB" available="15857.2 MiB" kv="640.0 MiB" fulloffload="324.0 MiB" partialoffload="1104.5 MiB"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.721Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.721Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama799514660/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-4fe022a8902336d3c452c88f7aca5590f5b5b02ccfd06320fdefab02412e1f0b --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 30 --port 45133"
Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.721Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding"
Apr 19 16:54:16 quorra ollama[3928462]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140347710894080","timestamp":1713545656}
Apr 19 16:54:16 quorra ollama[3928462]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140347710894080","timestamp":1713545656}
Apr 19 16:54:16 quorra ollama[3928462]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140347710894080","timestamp":1713545656,"total_threads":32}
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: loaded meta data with 21 key-value pairs and 723 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-4fe022a8902336d3c452c88f7aca5590f5b5b02ccfd06320fdefab02412e1f0b (version GGUF V3 (latest))
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-70B-Instruct
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv   2:                          llama.block_count u32              = 80
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 64
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128001
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv  20:               general.quantization_version u32              = 2
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - type  f32:  161 tensors
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - type q4_0:  561 tensors
Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - type q6_K:    1 tensors
Apr 19 16:54:16 quorra ollama[1180]: llm_load_vocab: special tokens definition check successful ( 256/128256 ).
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: format           = GGUF V3 (latest)
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: arch             = llama
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: vocab type       = BPE
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_vocab          = 128256
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_merges         = 280147
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_ctx_train      = 8192
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_embd           = 8192
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_head           = 64
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_head_kv        = 8
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_layer          = 80
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_rot            = 128
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_embd_head_k    = 128
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_embd_head_v    = 128
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_gqa            = 8
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_embd_k_gqa     = 1024
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_embd_v_gqa     = 1024
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_ff             = 28672
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_expert         = 0
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_expert_used    = 0
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: causal attn      = 1
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: pooling type     = 0
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: rope type        = 0
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: rope scaling     = linear
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: freq_base_train  = 500000.0
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: freq_scale_train = 1
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_yarn_orig_ctx  = 8192
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: rope_finetuned   = unknown
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: ssm_d_conv       = 0
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: ssm_d_inner      = 0
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: ssm_d_state      = 0
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: ssm_dt_rank      = 0
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model type       = 70B
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model ftype      = Q4_0
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model params     = 70.55 B
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model size       = 37.22 GiB (4.53 BPW)
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: general.name     = Meta-Llama-3-70B-Instruct
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: LF token         = 128 'Ä'
Apr 19 16:54:17 quorra ollama[1180]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
Apr 19 16:54:17 quorra ollama[1180]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Apr 19 16:54:17 quorra ollama[1180]: ggml_cuda_init: found 1 CUDA devices:
Apr 19 16:54:17 quorra ollama[1180]:   Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
Apr 19 16:54:17 quorra ollama[1180]: llm_load_tensors: ggml ctx size =    0.55 MiB
Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors: offloading 30 repeating layers to GPU
Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors: offloaded 30/81 layers to GPU
Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors:        CPU buffer size = 38110.61 MiB
Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors:      CUDA0 buffer size = 13771.88 MiB
Apr 19 16:54:31 quorra ollama[1180]: ...................................................................................................
Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: n_ctx      = 2048
Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: n_batch    = 512
Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: n_ubatch   = 512
Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: freq_base  = 500000.0
Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: freq_scale = 1
Apr 19 16:54:31 quorra ollama[1180]: llama_kv_cache_init:  CUDA_Host KV buffer size =   400.00 MiB
Apr 19 16:54:31 quorra ollama[1180]: llama_kv_cache_init:      CUDA0 KV buffer size =   240.00 MiB
Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.52 MiB
Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model:      CUDA0 compute buffer size =  1104.45 MiB
Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model:  CUDA_Host compute buffer size =    20.01 MiB
Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: graph nodes  = 2566
Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: graph splits = 554
Apr 19 16:54:32 quorra ollama[3928462]: {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140347710894080","timestamp":1713545672}
<!-- gh-comment-id:2066979440 --> @MarkWard0110 commented on GitHub (Apr 19, 2024): @remy415 , I'm now curious why the GPU's RAM seems not fully utilized when loading large models like llama3:70b-instruct. Given the following hardware Nvidia RTX 4070 (16GB), Intel i9 14900k 96GB How would I understand the utilization that I see? GPU memory utilization is 0-20%, with Ollama running the model on CPU. Is there a size limit to what Ollama can even split among available resources? Here is the log of when it loaded the model ``` Apr 19 16:54:15 quorra ollama[1180]: time=2024-04-19T16:54:15.973Z level=INFO source=routes.go:97 msg="changing loaded model" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.636Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.636Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.636Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.636Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.636Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.666Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.681Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.681Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.681Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.681Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.681Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.706Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.721Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=30 layers=30 required="38968.0 MiB" used="15573.3 MiB" available="15857.2 MiB" kv="640.0 MiB" fulloffload="324.0 MiB" partialoffload="1104.5 MiB" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.721Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.721Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama799514660/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-4fe022a8902336d3c452c88f7aca5590f5b5b02ccfd06320fdefab02412e1f0b --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 30 --port 45133" Apr 19 16:54:16 quorra ollama[1180]: time=2024-04-19T16:54:16.721Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding" Apr 19 16:54:16 quorra ollama[3928462]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140347710894080","timestamp":1713545656} Apr 19 16:54:16 quorra ollama[3928462]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140347710894080","timestamp":1713545656} Apr 19 16:54:16 quorra ollama[3928462]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140347710894080","timestamp":1713545656,"total_threads":32} Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: loaded meta data with 21 key-value pairs and 723 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-4fe022a8902336d3c452c88f7aca5590f5b5b02ccfd06320fdefab02412e1f0b (version GGUF V3 (latest)) Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 0: general.architecture str = llama Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 1: general.name str = Meta-Llama-3-70B-Instruct Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 2: llama.block_count u32 = 80 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 3: llama.context_length u32 = 8192 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 4: llama.embedding_length u32 = 8192 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 6: llama.attention.head_count u32 = 64 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 10: general.file_type u32 = 2 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 128000 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 128001 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 19: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - kv 20: general.quantization_version u32 = 2 Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - type f32: 161 tensors Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - type q4_0: 561 tensors Apr 19 16:54:16 quorra ollama[1180]: llama_model_loader: - type q6_K: 1 tensors Apr 19 16:54:16 quorra ollama[1180]: llm_load_vocab: special tokens definition check successful ( 256/128256 ). Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: format = GGUF V3 (latest) Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: arch = llama Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: vocab type = BPE Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_vocab = 128256 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_merges = 280147 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_ctx_train = 8192 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_embd = 8192 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_head = 64 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_head_kv = 8 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_layer = 80 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_rot = 128 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_embd_head_k = 128 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_embd_head_v = 128 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_gqa = 8 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_embd_k_gqa = 1024 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_embd_v_gqa = 1024 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: f_norm_eps = 0.0e+00 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: f_logit_scale = 0.0e+00 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_ff = 28672 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_expert = 0 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_expert_used = 0 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: causal attn = 1 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: pooling type = 0 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: rope type = 0 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: rope scaling = linear Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: freq_base_train = 500000.0 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: freq_scale_train = 1 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: n_yarn_orig_ctx = 8192 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: rope_finetuned = unknown Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: ssm_d_conv = 0 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: ssm_d_inner = 0 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: ssm_d_state = 0 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: ssm_dt_rank = 0 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model type = 70B Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model ftype = Q4_0 Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model params = 70.55 B Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW) Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: general.name = Meta-Llama-3-70B-Instruct Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: LF token = 128 'Ä' Apr 19 16:54:17 quorra ollama[1180]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes Apr 19 16:54:17 quorra ollama[1180]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no Apr 19 16:54:17 quorra ollama[1180]: ggml_cuda_init: found 1 CUDA devices: Apr 19 16:54:17 quorra ollama[1180]: Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes Apr 19 16:54:17 quorra ollama[1180]: llm_load_tensors: ggml ctx size = 0.55 MiB Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors: offloading 30 repeating layers to GPU Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors: offloaded 30/81 layers to GPU Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors: CPU buffer size = 38110.61 MiB Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors: CUDA0 buffer size = 13771.88 MiB Apr 19 16:54:31 quorra ollama[1180]: ................................................................................................... Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: n_ctx = 2048 Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: n_batch = 512 Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: n_ubatch = 512 Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: freq_base = 500000.0 Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: freq_scale = 1 Apr 19 16:54:31 quorra ollama[1180]: llama_kv_cache_init: CUDA_Host KV buffer size = 400.00 MiB Apr 19 16:54:31 quorra ollama[1180]: llama_kv_cache_init: CUDA0 KV buffer size = 240.00 MiB Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: CUDA0 compute buffer size = 1104.45 MiB Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: graph nodes = 2566 Apr 19 16:54:31 quorra ollama[1180]: llama_new_context_with_model: graph splits = 554 Apr 19 16:54:32 quorra ollama[3928462]: {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140347710894080","timestamp":1713545672} ```
Author
Owner

@remy415 commented on GitHub (Apr 19, 2024):

Yes, the model needs to fit entirely in the GPU memory, plus approximately 10% for overhead buffer, for it to offload all the layers. The 70b is a huge model, and in 16gb even a 13b would have a hard time. Definitely keep the models below GPU VRAM if you want acceleration

<!-- gh-comment-id:2066998355 --> @remy415 commented on GitHub (Apr 19, 2024): Yes, the model needs to fit entirely in the GPU memory, plus approximately 10% for overhead buffer, for it to offload all the layers. The 70b is a huge model, and in 16gb even a 13b would have a hard time. Definitely keep the models below GPU VRAM if you want acceleration
Author
Owner

@MarkWard0110 commented on GitHub (Apr 19, 2024):

@remy415 ,
are you saying Ollama will only run a CPU model if it does not fit in the GPU memory? I thought Ollama splits models among the available resources, with priority on GPU.
For example, if l load llama3:70b. Ollama would load some of it into the GPU memory and then the rest of it into CPU memory. That is the idea why I am asking why the GPU RAM does not appear to be fully utilized when loading the model.

I may not understand what these parts of the log mean

Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors: offloading 30 repeating layers to GPU
Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors: offloaded 30/81 layers to GPU
Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors:        CPU buffer size = 38110.61 MiB
Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors:      CUDA0 buffer size = 13771.88 MiB

I am thinking that means I should see 13 MiB of the GPU used. I'm not seeing that amount of v-ram used on the GPU. It has 0% and maybe blips up to 15% while this model is executing.

<!-- gh-comment-id:2067031290 --> @MarkWard0110 commented on GitHub (Apr 19, 2024): @remy415 , are you saying Ollama will only run a CPU model if it does not fit in the GPU memory? I thought Ollama splits models among the available resources, with priority on GPU. For example, if l load llama3:70b. Ollama would load some of it into the GPU memory and then the rest of it into CPU memory. That is the idea why I am asking why the GPU RAM does not appear to be fully utilized when loading the model. I may not understand what these parts of the log mean ``` Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors: offloading 30 repeating layers to GPU Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors: offloaded 30/81 layers to GPU Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors: CPU buffer size = 38110.61 MiB Apr 19 16:54:30 quorra ollama[1180]: llm_load_tensors: CUDA0 buffer size = 13771.88 MiB ``` I am thinking that means I should see 13 MiB of the GPU used. I'm not seeing that amount of v-ram used on the GPU. It has 0% and maybe blips up to 15% while this model is executing.
Author
Owner

@remy415 commented on GitHub (Apr 19, 2024):

I think the misunderstanding is this: CPU and GPU cannot efficiently and effectively work together to run inference on a model. That's not to say that the CPU doesn't do things when all layers are offloaded, I'm saying that you can't really say that "use 100% gpu = fast, use 100% gpu and 100% cpu = faster" as it doesn't really work like that. The "layers" are extremely intertwined and there is a lot of context switching between CPU and GPU when both are used, which is a huge drag on performance. That is why you want 100% layers in the GPU: context switching is very expensive in terms of adding latency.

Additionally, maybe this chart will help put this into perspective:
image

Most of us use the (int4s) version of a model as referenced here in your log:

Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model type       = 70B
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model ftype      = Q4_0 <-- 4 bit quantization
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model params     = 70.55 B <-- 70B parameters
Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model size       = 37.22 GiB (4.53 BPW) <-- needs 37GiB of RAM

Your 4070ti has ~14 GiB of memory free (16 GiB - OS reserved memory - "10% overhead buffer" = approx 14 GiB). In order to go maximum speed with your GPU, you need to select a model that fits in the ~14GiB memory space (preferably with a little wiggle room). According to the chart, the 7B and 13B models will fit just fine and anything larger will split it across CPU and GPU. Splitting it should be avoided at all costs as context switching between CPU and GPU is one of the most time-consuming processes, and your inference will likely run slower splitting it than it would by simply going 100% CPU (with AVX2 or something).

Unless you have a GPU with a premium amount of RAM (4090ti, A100, Jetson AGX Orin), your optimal model size for LLMs is between 7B and 13B parameters, and even then you may still find better results with the smaller models.

<!-- gh-comment-id:2067068461 --> @remy415 commented on GitHub (Apr 19, 2024): I think the misunderstanding is this: CPU and GPU cannot efficiently and effectively work together to run inference on a model. That's not to say that the CPU doesn't do things when all layers are offloaded, I'm saying that you can't really say that "use 100% gpu = fast, use 100% gpu and 100% cpu = faster" as it doesn't really work like that. The "layers" are extremely intertwined and there is a lot of context switching between CPU and GPU when both are used, which is a huge drag on performance. That is why you want 100% layers in the GPU: context switching is very expensive in terms of adding latency. Additionally, maybe this chart will help put this into perspective: ![image](https://github.com/ollama/ollama/assets/105550370/af139080-d41e-4aab-b78e-ba0cea43b4ca) Most of us use the (int4s) version of a model as referenced here in your log: ``` Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model type = 70B Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model ftype = Q4_0 <-- 4 bit quantization Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model params = 70.55 B <-- 70B parameters Apr 19 16:54:16 quorra ollama[1180]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW) <-- needs 37GiB of RAM ``` Your 4070ti has ~14 GiB of memory free (16 GiB - OS reserved memory - "10% overhead buffer" = approx 14 GiB). In order to go maximum speed with your GPU, you need to select a model that fits in the ~14GiB memory space (preferably with a little wiggle room). According to the chart, the 7B and 13B models will fit just fine and anything larger will split it across CPU and GPU. Splitting it should be avoided at all costs as context switching between CPU and GPU is one of the most time-consuming processes, and your inference will likely run **slower** splitting it than it would by simply going 100% CPU (with AVX2 or something). Unless you have a GPU with a premium amount of RAM (4090ti, A100, Jetson AGX Orin), your optimal model size for LLMs is between 7B and 13B parameters, and even then you may still find better results with the smaller models.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 19, 2024):

@remy415 ,
I understand that I must appropriately select a model for GPU acceleration. Knowing the larger models + CPU is painfully slow compared to GPU I would like to better understand what is going on when I am running models larger than available GPU memory.

For example, when I load codellama:13b-instruct

llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 6.86 GiB (4.53 BPW)

It logs

llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =    87.93 MiB
llm_load_tensors:      CUDA0 buffer size =  6936.07 MiB

I find that it loads into GPU and is accelerated.

When I ignore how long it will take to execute and load a large model like llama3:70b-instruct

llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 37.22 GiB (4.53 BPW)

It logs the following

llm_load_tensors: offloading 30 repeating layers to GPU
llm_load_tensors: offloaded 30/81 layers to GPU
llm_load_tensors:        CPU buffer size = 38110.61 MiB
llm_load_tensors:      CUDA0 buffer size = 13771.88 MiB

What does it mean it offloaded 30/81 layers to GPU?
Why is the CUDA0 buffer size what it is? Why would it be different from when I loaded the smaller model?
It makes me think the model has 30 layers put on the GPU and 51 on CPU. However, this conflicts with you saying it would be worse than 100% CPU if that was the case.

I don't have the intuition of what its like for a model to execute. Perhaps if I did I might understand the splitting better. I don't know how much of it is parallel or state it must share when running.

What about the situations where some have multiple GPUs?

<!-- gh-comment-id:2067141347 --> @MarkWard0110 commented on GitHub (Apr 19, 2024): @remy415 , I understand that I must appropriately select a model for GPU acceleration. Knowing the larger models + CPU is painfully slow compared to GPU I would like to better understand what is going on when I am running models larger than available GPU memory. For example, when I load `codellama:13b-instruct` ``` llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 13.02 B llm_load_print_meta: model size = 6.86 GiB (4.53 BPW) ``` It logs ``` llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 41/41 layers to GPU llm_load_tensors: CPU buffer size = 87.93 MiB llm_load_tensors: CUDA0 buffer size = 6936.07 MiB ``` I find that it loads into GPU and is accelerated. When I ignore how long it will take to execute and load a large model like `llama3:70b-instruct` ``` llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 37.22 GiB (4.53 BPW) ``` It logs the following ``` llm_load_tensors: offloading 30 repeating layers to GPU llm_load_tensors: offloaded 30/81 layers to GPU llm_load_tensors: CPU buffer size = 38110.61 MiB llm_load_tensors: CUDA0 buffer size = 13771.88 MiB ``` What does it mean it offloaded 30/81 layers to GPU? Why is the CUDA0 buffer size what it is? Why would it be different from when I loaded the smaller model? It makes me think the model has 30 layers put on the GPU and 51 on CPU. However, this conflicts with you saying it would be worse than 100% CPU if that was the case. I don't have the intuition of what its like for a model to execute. Perhaps if I did I might understand the splitting better. I don't know how much of it is parallel or state it must share when running. What about the situations where some have multiple GPUs?
Author
Owner

@remy415 commented on GitHub (Apr 19, 2024):

So what I’m about to say is a very oversimplified explanation based on a rudimentary understanding of AI.

Think of the model as a web of random numbers that can be used to calculate the missing pieces of a data set. In terms of an LLM, words are turned into random numbers and their associations are “trained” into the model, so when I ask why is the sky blue, it references it’s trained associations and gives an answer.

In terms of memory, reading and writing it works best when it’s in uninterrupted contiguous chunks. Unfortunately models are a random mix of data — your answer consists of data points that exist in different parts of the model.

When you split load the model, some of your answer is in GPU, some is in CPU. The program has to jump back and forth between reading cpu and gpu memory (this is a context switch as I referenced earlier). This process of switching back and forth is extremely slow in computer land, relatively speaking. That’s why working purely on cpu is faster than a split load — you spend more time jumping from cpu to gpu than you do processing.

The reason the layers loaded is the way it is, is because each layer is a fixed size and represents a semi-logical chunk of the model. 30 layers takes 13771 mb, and adding any more layers would cause an out of memory error. The smaller models have a different number and size of layers and they fit totally into your available vram.

I hope that helps!

<!-- gh-comment-id:2067234928 --> @remy415 commented on GitHub (Apr 19, 2024): So what I’m about to say is a very oversimplified explanation based on a rudimentary understanding of AI. Think of the model as a web of random numbers that can be used to calculate the missing pieces of a data set. In terms of an LLM, words are turned into random numbers and their associations are “trained” into the model, so when I ask why is the sky blue, it references it’s trained associations and gives an answer. In terms of memory, reading and writing it works best when it’s in uninterrupted contiguous chunks. Unfortunately models are a random mix of data — your answer consists of data points that exist in different parts of the model. When you split load the model, some of your answer is in GPU, some is in CPU. The program has to jump back and forth between reading cpu and gpu memory (this is a context switch as I referenced earlier). This process of switching back and forth is extremely slow in computer land, relatively speaking. That’s why working purely on cpu is faster than a split load — you spend more time jumping from cpu to gpu than you do processing. The reason the layers loaded is the way it is, is because each layer is a fixed size and represents a semi-logical chunk of the model. 30 layers takes 13771 mb, and adding any more layers would cause an out of memory error. The smaller models have a different number and size of layers and they fit totally into your available vram. I hope that helps!
Author
Owner

@remy415 commented on GitHub (Apr 22, 2024):

@MarkWard0110 Sorry I forgot to talk about the multiple GPUs.

Most systems that can leverage multiple GPUs, whether it's additional GPUs in the same computer or multiple computers with GPUs, usually use custom models and processors that are designed to be split up. The reason you don't see this often in things like llama_cpp (they're actually working on it at the moment, keep an eye on their github) is because it's extremely hard to optimize it so that the splitting is done in a way that is faster. Remember the context switching I spoke of before? When you split over multiple computers, now the context switch is done over your network instead of just the GPU to RAM bus, which is exponentially slower than the previously mentioned context switch. What ends up happening is the network (typically 1 Gbps - overhead in most home settings, so ~950 Mbps) becomes a huge bottleneck. The end result is a mess where each system is waiting for the other to produce their part(s), and instead of answering the question in ~20 seconds it takes 4 minutes or longer.

The way companies like Amazon are able to do multi-gpu processing for it is:

  1. They have accelerated interconnects between their nodes (40 Gbps on the low end, I'm sure their top-line stuff goes faster)
  2. They have teams of engineers that are building the overall system in a way that it can benefit from multiple gpus processing the data that is optimized for their system.

As for multiple GPUs on a single system, the issue is similar to the context switching before: to send data from GPU1 to GPU2, it has to go over the PCIe bus, which again is much slower than just jumping to the next memory block.

Remember that this is a very oversimplified response based on a very basic understanding of how AI/ML works; I'm not an expert by any means, I'm just a technology fan.

<!-- gh-comment-id:2069634824 --> @remy415 commented on GitHub (Apr 22, 2024): @MarkWard0110 Sorry I forgot to talk about the multiple GPUs. Most systems that can leverage multiple GPUs, whether it's additional GPUs in the same computer or multiple computers with GPUs, usually use custom models and processors that are designed to be split up. The reason you don't see this often in things like llama_cpp (they're actually working on it at the moment, keep an eye on their github) is because it's extremely hard to optimize it so that the splitting is done in a way that is faster. Remember the context switching I spoke of before? When you split over multiple computers, now the context switch is done over your network instead of just the GPU to RAM bus, which is exponentially slower than the previously mentioned context switch. What ends up happening is the network (typically 1 Gbps - overhead in most home settings, so ~950 Mbps) becomes a huge bottleneck. The end result is a mess where each system is waiting for the other to produce their part(s), and instead of answering the question in ~20 seconds it takes 4 minutes or longer. The way companies like Amazon are able to do multi-gpu processing for it is: 1) They have accelerated interconnects between their nodes (40 Gbps on the low end, I'm sure their top-line stuff goes faster) 2) They have teams of engineers that are building the overall system in a way that it can benefit from multiple gpus processing the data that is optimized for their system. As for multiple GPUs on a single system, the issue is similar to the context switching before: to send data from GPU1 to GPU2, it has to go over the PCIe bus, which again is much slower than just jumping to the next memory block. Remember that this is a very oversimplified response based on a very basic understanding of how AI/ML works; I'm not an expert by any means, I'm just a technology fan.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 23, 2024):

@remy415 ,
My local build of Ollama failed to include GPU support so I have a test run of CPU only and can compare it to my split runs.

For the same prompt.

The following are CPU only

"Model","Duration","TokensPerSecond"
"command-r:35b","00:02:46.5118881","2.6933559948852084"
"command-r:35b","00:02:16.7373528","2.7163071426924623"
"command-r:35b","00:03:03.3570712","2.6933685904112536"
"command-r:35b","00:00:09.5622642","2.8190653388868925"
"command-r:35b","00:02:32.7151065","2.74271767216686"

These are when it was split between CPU and my GPU

"Model","Duration","TokensPerSecond"
"command-r:35b","00:01:02.6563793","5.893486133334735"
"command-r:35b","00:01:15.8155590","5.819720988225828"
"command-r:35b","00:01:36.1705502","5.801072029718944"
"command-r:35b","00:00:03.8362888","6.404858811292363"
"command-r:35b","00:00:54.0697454","5.885712599742418"

CPU only is ~2 TPS
CPU+GPU is ~5 TPS
Even with the additional overhead it may have, the GPU has provided some help.

<!-- gh-comment-id:2073638303 --> @MarkWard0110 commented on GitHub (Apr 23, 2024): @remy415 , My local build of Ollama failed to include GPU support so I have a test run of CPU only and can compare it to my split runs. For the same prompt. The following are CPU only ``` "Model","Duration","TokensPerSecond" "command-r:35b","00:02:46.5118881","2.6933559948852084" "command-r:35b","00:02:16.7373528","2.7163071426924623" "command-r:35b","00:03:03.3570712","2.6933685904112536" "command-r:35b","00:00:09.5622642","2.8190653388868925" "command-r:35b","00:02:32.7151065","2.74271767216686" ``` These are when it was split between CPU and my GPU ``` "Model","Duration","TokensPerSecond" "command-r:35b","00:01:02.6563793","5.893486133334735" "command-r:35b","00:01:15.8155590","5.819720988225828" "command-r:35b","00:01:36.1705502","5.801072029718944" "command-r:35b","00:00:03.8362888","6.404858811292363" "command-r:35b","00:00:54.0697454","5.885712599742418" ``` CPU only is ~2 TPS CPU+GPU is ~5 TPS Even with the additional overhead it may have, the GPU has provided some help.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2032