[GH-ISSUE #10858] GPU Layer Loading Issue with Unsloth Dynamic 2.0 Quantized Models #32892

Closed
opened 2026-04-22 14:48:42 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @Andy365-365 on GitHub (May 26, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10858

What is the issue?

Environment

  • OS: Linux (Ubuntu/Debian based)
  • Ollama Version: 0.7.1
  • GPU: 2x RTX 3080 (10GB VRAM each)
  • CUDA Version: 12.4
  • Driver Version: 550.54.15
  • Model: hf.co/unsloth/Qwen3-30B-A3B-GGUF:UD-Q4_K_XL

Issue Description

Ollama is significantly under-utilizing GPU resources when loading Unsloth Dynamic 2.0 quantized models from Hugging Face, loading insufficient GPU layers and resulting in poor inference performance.

Expected Behavior

  • GPU layers should be automatically optimized based on available VRAM
  • Similar to official Ollama models, GPU layer loading should be maximized within memory constraints

Actual Behavior

  • Only 16 out of 49 layers are loaded to GPU despite having ~20GB total VRAM available
  • Most computation happens on CPU, severely impacting inference speed
  • Manual OLLAMA_GPU_LAYERS override is required for optimal performance

Reproduction Steps

  1. Pull the model:

    ollama pull hf.co/unsloth/Qwen3-30B-A3B-GGUF:UD-Q4_K_XL
    
  2. Run the model:

    ollama run hf.co/unsloth/Qwen3-30B-A3B-GGUF:UD-Q4_K_XL
    
  3. Check GPU utilization:

    nvidia-smi
    journalctl -u ollama --since "5 minutes ago" | grep -E "(layer|gpu|load)"
    

Log Output

root@ai-tm:/data/ollama# journalctl -u ollama --since "5 minutes ago" | grep -E "(layer|gpu|load|offload)"
5月 26 10:58:23 ai-tm ollama[40463]: time=2025-05-26T10:58:23.170+08:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=16 layers.split=8,8 memory.available="[9.5 GiB 9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="31.8 GiB" memory.required.partial="18.4 GiB" memory.required.kv="3.8 GiB" memory.required.allocations="[9.2 GiB 9.2 GiB]" memory.weights.total="16.3 GiB" memory.weights.repeating="16.1 GiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="5.0 GiB" memory.graph.partial="5.0 GiB"
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from /data/ollama/models/blobs/sha256-263346cdf8c9824cc332d2b00a84100a5be231ac600e7a875c6a2b47c9802f57 (version GGUF V3 (latest))
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   1:                               general.type str              = model
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   2:                               general.name str              = Qwen3-30B-A3B
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   3:                           general.basename str              = Qwen3-30B-A3B
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   5:                         general.size_label str              = 30B-A3B
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   6:                           general.repo_url str              = https://huggingface.co/unsloth

...

5月 26 10:58:24 ai-tm ollama[40463]: print_info: n_layer          = 48
5月 26 10:58:24 ai-tm ollama[40463]: load_tensors: loading model tensors, this can take a while... (mmap = true)
5月 26 10:58:25 ai-tm ollama[40463]: load_tensors: offloading 16 repeating layers to GPU
5月 26 10:58:25 ai-tm ollama[40463]: load_tensors: offloaded 16/49 layers to GPU
5月 26 10:58:25 ai-tm ollama[40463]: load_tensors:        CUDA0 model buffer size =  2705.13 MiB
5月 26 10:58:25 ai-tm ollama[40463]: load_tensors:        CUDA1 model buffer size =  2866.45 MiB
5月 26 10:58:25 ai-tm ollama[40463]: load_tensors:   CPU_Mapped model buffer size = 11317.70 MiB
5月 26 10:58:26 ai-tm ollama[40463]: llama_kv_cache_unified: kv_size = 40960, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 32
root@ai-tm:/data/ollama# 

Relevant log output

root@ai-tm:/data/ollama# journalctl -u ollama --since "5 minutes ago" | grep -E "(layer|gpu|load|offload)"
5月 26 10:58:23 ai-tm ollama[40463]: time=2025-05-26T10:58:23.170+08:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=16 layers.split=8,8 memory.available="[9.5 GiB 9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="31.8 GiB" memory.required.partial="18.4 GiB" memory.required.kv="3.8 GiB" memory.required.allocations="[9.2 GiB 9.2 GiB]" memory.weights.total="16.3 GiB" memory.weights.repeating="16.1 GiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="5.0 GiB" memory.graph.partial="5.0 GiB"
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from /data/ollama/models/blobs/sha256-263346cdf8c9824cc332d2b00a84100a5be231ac600e7a875c6a2b47c9802f57 (version GGUF V3 (latest))
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   1:                               general.type str              = model
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   2:                               general.name str              = Qwen3-30B-A3B
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   3:                           general.basename str              = Qwen3-30B-A3B
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   5:                         general.size_label str              = 30B-A3B
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   6:                           general.repo_url str              = https://huggingface.co/unsloth
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   7:                       qwen3moe.block_count u32              = 48
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   8:                    qwen3moe.context_length u32              = 40960
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   9:                  qwen3moe.embedding_length u32              = 2048
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  10:               qwen3moe.feed_forward_length u32              = 6144
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  11:              qwen3moe.attention.head_count u32              = 32
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  12:           qwen3moe.attention.head_count_kv u32              = 4
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  13:                    qwen3moe.rope.freq_base f32              = 1000000.000000
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  14:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  15:                 qwen3moe.expert_used_count u32              = 8
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  16:              qwen3moe.attention.key_length u32              = 128
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  17:            qwen3moe.attention.value_length u32              = 128
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  18:                      qwen3moe.expert_count u32              = 128
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  19:        qwen3moe.expert_feed_forward_length u32              = 768
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = qwen2
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 151645
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 151654
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  29:               general.quantization_version u32              = 2
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  30:                          general.file_type u32              = 15
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  31:                      quantize.imatrix.file str              = Qwen3-30B-A3B-GGUF/imatrix_unsloth.dat
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-30B-A3B.txt
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 384
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 685
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type  f32:  241 tensors
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type q4_K:  290 tensors
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type q5_K:   37 tensors
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type q6_K:   11 tensors
5月 26 10:58:23 ai-tm ollama[40463]: load: special tokens cache size = 26
5月 26 10:58:23 ai-tm ollama[40463]: load: token to piece cache size = 0.9311 MB
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_load: vocab only - skipping tensors
5月 26 10:58:23 ai-tm ollama[40463]: time=2025-05-26T10:58:23.515+08:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /data/ollama/models/blobs/sha256-263346cdf8c9824cc332d2b00a84100a5be231ac600e7a875c6a2b47c9802f57 --ctx-size 40960 --batch-size 512 --n-gpulayers 16 --threads 20 --parallel 1 --tensor-split 8,8 --port 36765"
5月 26 10:58:23 ai-tm ollama[40463]: time=2025-05-26T10:58:23.515+08:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
5月 26 10:58:23 ai-tm ollama[40463]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-skylakex.so
5月 26 10:58:23 ai-tm ollama[40463]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080) - 9773 MiB free
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3080) - 9773 MiB free
5月 26 10:58:23 ai-tm ollama[40463]: time=2025-05-26T10:58:23.767+08:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from /data/ollama/models/blobs/sha256-263346cdf8c9824cc332d2b00a84100a5be231ac600e7a875c6a2b47c9802f57 (version GGUF V3 (latest))
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   1:                               general.type str              = model
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   2:                               general.name str              = Qwen3-30B-A3B
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   3:                           general.basename str              = Qwen3-30B-A3B
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   5:                         general.size_label str              = 30B-A3B
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   6:                           general.repo_url str              = https://huggingface.co/unsloth
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   7:                       qwen3moe.block_count u32              = 48
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   8:                    qwen3moe.context_length u32              = 40960
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv   9:                  qwen3moe.embedding_length u32              = 2048
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  10:               qwen3moe.feed_forward_length u32              = 6144
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  11:              qwen3moe.attention.head_count u32              = 32
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  12:           qwen3moe.attention.head_count_kv u32              = 4
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  13:                    qwen3moe.rope.freq_base f32              = 1000000.000000
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  14:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  15:                 qwen3moe.expert_used_count u32              = 8
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  16:              qwen3moe.attention.key_length u32              = 128
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  17:            qwen3moe.attention.value_length u32              = 128
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  18:                      qwen3moe.expert_count u32              = 128
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  19:        qwen3moe.expert_feed_forward_length u32              = 768
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = qwen2
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 151645
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 151654
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  29:               general.quantization_version u32              = 2
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  30:                          general.file_type u32              = 15
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  31:                      quantize.imatrix.file str              = Qwen3-30B-A3B-GGUF/imatrix_unsloth.dat
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-30B-A3B.txt
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 384
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 685
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type  f32:  241 tensors
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type q4_K:  290 tensors
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type q5_K:   37 tensors
5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type q6_K:   11 tensors
5月 26 10:58:24 ai-tm ollama[40463]: load: special tokens cache size = 26
5月 26 10:58:24 ai-tm ollama[40463]: load: token to piece cache size = 0.9311 MB
5月 26 10:58:24 ai-tm ollama[40463]: print_info: n_layer          = 48
5月 26 10:58:24 ai-tm ollama[40463]: load_tensors: loading model tensors, this can take a while... (mmap = true)
5月 26 10:58:25 ai-tm ollama[40463]: load_tensors: offloading 16 repeating layers to GPU
5月 26 10:58:25 ai-tm ollama[40463]: load_tensors: offloaded 16/49 layers to GPU
5月 26 10:58:25 ai-tm ollama[40463]: load_tensors:        CUDA0 model buffer size =  2705.13 MiB
5月 26 10:58:25 ai-tm ollama[40463]: load_tensors:        CUDA1 model buffer size =  2866.45 MiB
5月 26 10:58:25 ai-tm ollama[40463]: load_tensors:   CPU_Mapped model buffer size = 11317.70 MiB
5月 26 10:58:26 ai-tm ollama[40463]: llama_kv_cache_unified: kv_size = 40960, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 32
root@ai-tm:/data/ollama#

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

ollama version is 0.7.1

Originally created by @Andy365-365 on GitHub (May 26, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10858 ### What is the issue? ## Environment - **OS**: Linux (Ubuntu/Debian based) - **Ollama Version**: 0.7.1 - **GPU**: 2x RTX 3080 (10GB VRAM each) - **CUDA Version**: 12.4 - **Driver Version**: 550.54.15 - **Model**: `hf.co/unsloth/Qwen3-30B-A3B-GGUF:UD-Q4_K_XL` ## Issue Description Ollama is significantly under-utilizing GPU resources when loading Unsloth Dynamic 2.0 quantized models from Hugging Face, loading insufficient GPU layers and resulting in poor inference performance. ## Expected Behavior - GPU layers should be automatically optimized based on available VRAM - Similar to official Ollama models, GPU layer loading should be maximized within memory constraints ## Actual Behavior - Only **16 out of 49 layers** are loaded to GPU despite having ~20GB total VRAM available - Most computation happens on CPU, severely impacting inference speed - Manual `OLLAMA_GPU_LAYERS` override is required for optimal performance ## Reproduction Steps 1. Pull the model: ```bash ollama pull hf.co/unsloth/Qwen3-30B-A3B-GGUF:UD-Q4_K_XL ``` 2. Run the model: ```bash ollama run hf.co/unsloth/Qwen3-30B-A3B-GGUF:UD-Q4_K_XL ``` 3. Check GPU utilization: ```bash nvidia-smi journalctl -u ollama --since "5 minutes ago" | grep -E "(layer|gpu|load)" ``` ## Log Output ``` root@ai-tm:/data/ollama# journalctl -u ollama --since "5 minutes ago" | grep -E "(layer|gpu|load|offload)" 5月 26 10:58:23 ai-tm ollama[40463]: time=2025-05-26T10:58:23.170+08:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=16 layers.split=8,8 memory.available="[9.5 GiB 9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="31.8 GiB" memory.required.partial="18.4 GiB" memory.required.kv="3.8 GiB" memory.required.allocations="[9.2 GiB 9.2 GiB]" memory.weights.total="16.3 GiB" memory.weights.repeating="16.1 GiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="5.0 GiB" memory.graph.partial="5.0 GiB" 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from /data/ollama/models/blobs/sha256-263346cdf8c9824cc332d2b00a84100a5be231ac600e7a875c6a2b47c9802f57 (version GGUF V3 (latest)) 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 0: general.architecture str = qwen3moe 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 1: general.type str = model 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 4: general.quantized_by str = Unsloth 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 5: general.size_label str = 30B-A3B 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth ... 5月 26 10:58:24 ai-tm ollama[40463]: print_info: n_layer = 48 5月 26 10:58:24 ai-tm ollama[40463]: load_tensors: loading model tensors, this can take a while... (mmap = true) 5月 26 10:58:25 ai-tm ollama[40463]: load_tensors: offloading 16 repeating layers to GPU 5月 26 10:58:25 ai-tm ollama[40463]: load_tensors: offloaded 16/49 layers to GPU 5月 26 10:58:25 ai-tm ollama[40463]: load_tensors: CUDA0 model buffer size = 2705.13 MiB 5月 26 10:58:25 ai-tm ollama[40463]: load_tensors: CUDA1 model buffer size = 2866.45 MiB 5月 26 10:58:25 ai-tm ollama[40463]: load_tensors: CPU_Mapped model buffer size = 11317.70 MiB 5月 26 10:58:26 ai-tm ollama[40463]: llama_kv_cache_unified: kv_size = 40960, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 32 root@ai-tm:/data/ollama# ``` ### Relevant log output ```shell root@ai-tm:/data/ollama# journalctl -u ollama --since "5 minutes ago" | grep -E "(layer|gpu|load|offload)" 5月 26 10:58:23 ai-tm ollama[40463]: time=2025-05-26T10:58:23.170+08:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=16 layers.split=8,8 memory.available="[9.5 GiB 9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="31.8 GiB" memory.required.partial="18.4 GiB" memory.required.kv="3.8 GiB" memory.required.allocations="[9.2 GiB 9.2 GiB]" memory.weights.total="16.3 GiB" memory.weights.repeating="16.1 GiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="5.0 GiB" memory.graph.partial="5.0 GiB" 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from /data/ollama/models/blobs/sha256-263346cdf8c9824cc332d2b00a84100a5be231ac600e7a875c6a2b47c9802f57 (version GGUF V3 (latest)) 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 0: general.architecture str = qwen3moe 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 1: general.type str = model 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 4: general.quantized_by str = Unsloth 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 5: general.size_label str = 30B-A3B 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151654 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 29: general.quantization_version u32 = 2 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 30: general.file_type u32 = 15 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 31: quantize.imatrix.file str = Qwen3-30B-A3B-GGUF/imatrix_unsloth.dat 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 32: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-30B-A3B.txt 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 384 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 685 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type f32: 241 tensors 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type q4_K: 290 tensors 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type q5_K: 37 tensors 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type q6_K: 11 tensors 5月 26 10:58:23 ai-tm ollama[40463]: load: special tokens cache size = 26 5月 26 10:58:23 ai-tm ollama[40463]: load: token to piece cache size = 0.9311 MB 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_load: vocab only - skipping tensors 5月 26 10:58:23 ai-tm ollama[40463]: time=2025-05-26T10:58:23.515+08:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /data/ollama/models/blobs/sha256-263346cdf8c9824cc332d2b00a84100a5be231ac600e7a875c6a2b47c9802f57 --ctx-size 40960 --batch-size 512 --n-gpulayers 16 --threads 20 --parallel 1 --tensor-split 8,8 --port 36765" 5月 26 10:58:23 ai-tm ollama[40463]: time=2025-05-26T10:58:23.515+08:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 5月 26 10:58:23 ai-tm ollama[40463]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-skylakex.so 5月 26 10:58:23 ai-tm ollama[40463]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080) - 9773 MiB free 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3080) - 9773 MiB free 5月 26 10:58:23 ai-tm ollama[40463]: time=2025-05-26T10:58:23.767+08:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from /data/ollama/models/blobs/sha256-263346cdf8c9824cc332d2b00a84100a5be231ac600e7a875c6a2b47c9802f57 (version GGUF V3 (latest)) 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 0: general.architecture str = qwen3moe 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 1: general.type str = model 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 4: general.quantized_by str = Unsloth 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 5: general.size_label str = 30B-A3B 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151654 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 29: general.quantization_version u32 = 2 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 30: general.file_type u32 = 15 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 31: quantize.imatrix.file str = Qwen3-30B-A3B-GGUF/imatrix_unsloth.dat 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 32: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-30B-A3B.txt 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 384 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 685 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type f32: 241 tensors 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type q4_K: 290 tensors 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type q5_K: 37 tensors 5月 26 10:58:23 ai-tm ollama[40463]: llama_model_loader: - type q6_K: 11 tensors 5月 26 10:58:24 ai-tm ollama[40463]: load: special tokens cache size = 26 5月 26 10:58:24 ai-tm ollama[40463]: load: token to piece cache size = 0.9311 MB 5月 26 10:58:24 ai-tm ollama[40463]: print_info: n_layer = 48 5月 26 10:58:24 ai-tm ollama[40463]: load_tensors: loading model tensors, this can take a while... (mmap = true) 5月 26 10:58:25 ai-tm ollama[40463]: load_tensors: offloading 16 repeating layers to GPU 5月 26 10:58:25 ai-tm ollama[40463]: load_tensors: offloaded 16/49 layers to GPU 5月 26 10:58:25 ai-tm ollama[40463]: load_tensors: CUDA0 model buffer size = 2705.13 MiB 5月 26 10:58:25 ai-tm ollama[40463]: load_tensors: CUDA1 model buffer size = 2866.45 MiB 5月 26 10:58:25 ai-tm ollama[40463]: load_tensors: CPU_Mapped model buffer size = 11317.70 MiB 5月 26 10:58:26 ai-tm ollama[40463]: llama_kv_cache_unified: kv_size = 40960, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 32 root@ai-tm:/data/ollama# ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version ollama version is 0.7.1
GiteaMirror added the bug label 2026-04-22 14:48:42 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 9, 2026):

5月 26 10:58:23 ai-tm ollama[40463]: time=2025-05-26T10:58:23.170+08:00 level=INFO source=server.go:168
 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=16 layers.split=8,8
 memory.available="[9.5 GiB 9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="31.8 GiB"
 memory.required.partial="18.4 GiB" memory.required.kv="3.8 GiB" memory.required.allocations="[9.2 GiB 9.2 GiB]"
 memory.weights.total="16.3 GiB" memory.weights.repeating="16.1 GiB" memory.weights.nonrepeating="243.4 MiB"
 memory.graph.full="5.0 GiB" memory.graph.partial="5.0 GiB"

5月 26 10:58:26 ai-tm ollama[40463]: llama_kv_cache_unified: kv_size = 40960, type_k = 'f16', type_v = 'f16',
 n_layer = 48, can_shift = 1, padding = 32

There is 9.5+9.5G available on the GPUs. Because of the 40k context, the model needs 31.8G of VRAM. Since only 19G
is available, only a portion of the model can be loaded in VRAM. The model uses 9.2+9.2 or 18.4G of the available 19G.

<!-- gh-comment-id:3873852407 --> @rick-github commented on GitHub (Feb 9, 2026): ``` 5月 26 10:58:23 ai-tm ollama[40463]: time=2025-05-26T10:58:23.170+08:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=16 layers.split=8,8 memory.available="[9.5 GiB 9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="31.8 GiB" memory.required.partial="18.4 GiB" memory.required.kv="3.8 GiB" memory.required.allocations="[9.2 GiB 9.2 GiB]" memory.weights.total="16.3 GiB" memory.weights.repeating="16.1 GiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="5.0 GiB" memory.graph.partial="5.0 GiB" 5月 26 10:58:26 ai-tm ollama[40463]: llama_kv_cache_unified: kv_size = 40960, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 32 ``` There is 9.5+9.5G available on the GPUs. Because of the 40k context, the model needs 31.8G of VRAM. Since only 19G is available, only a portion of the model can be loaded in VRAM. The model uses 9.2+9.2 or 18.4G of the available 19G.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32892