[GH-ISSUE #6994] Docker container cannot load model #30188

Closed
opened 2026-04-22 09:42:49 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @somnifex on GitHub (Sep 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6994

What is the issue?

Whether using ollama run or curl to use the model, it is impossible to load the model into GPU memory
docker logs ollama for starting and loading the ollama model are as follows

2024/09/27 05:29:20 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:10 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:20 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-09-27T05:29:20.148Z level=INFO source=images.go:753 msg="total blobs: 24"
time=2024-09-27T05:29:20.148Z level=INFO source=images.go:760 msg="total unused blobs removed: 0"
time=2024-09-27T05:29:20.148Z level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
time=2024-09-27T05:29:20.149Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v12 cpu cpu_avx cpu_avx2 cuda_v11]"
time=2024-09-27T05:29:20.149Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
time=2024-09-27T05:29:24.814Z level=INFO source=types.go:107 msg="inference compute" id=GPU-e8ee7d42-72a9-d27d-ef76-dfa4df69bf0f library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA RTX A6000" total="47.5 GiB" available="47.3 GiB"
time=2024-09-27T05:29:24.814Z level=INFO source=types.go:107 msg="inference compute" id=GPU-2325557b-dcde-10cd-b219-60ed716aa9ef library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA RTX A6000" total="47.5 GiB" available="47.3 GiB"
[GIN] 2024/09/27 - 05:31:07 | 200 |       60.09µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/09/27 - 05:31:07 | 200 |   58.078071ms |       127.0.0.1 | POST     "/api/show"
time=2024-09-27T05:31:07.477Z level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/root/.ollama/models/blobs/sha256-6e7fdda508e91cb0f63de5c15ff79ac63a1584ccafd751c07ca12b7f442101b8 library=cuda parallel=20 required="71.1 GiB"
time=2024-09-27T05:31:07.477Z level=INFO source=server.go:103 msg="system memory" total="503.5 GiB" free="495.5 GiB" free_swap="0 B"
time=2024-09-27T05:31:07.481Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=81 layers.split=41,40 memory.available="[47.3 GiB 47.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="71.1 GiB" memory.required.partial="71.1 GiB" memory.required.kv="12.5 GiB" memory.required.allocations="[36.1 GiB 35.0 GiB]" memory.weights.total="55.0 GiB" memory.weights.repeating="54.1 GiB" memory.weights.nonrepeating="974.6 MiB" memory.graph.full="6.4 GiB" memory.graph.partial="6.4 GiB"
time=2024-09-27T05:31:07.488Z level=INFO source=server.go:388 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6e7fdda508e91cb0f63de5c15ff79ac63a1584ccafd751c07ca12b7f442101b8 --ctx-size 40960 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --parallel 20 --tensor-split 41,40 --port 39693"
time=2024-09-27T05:31:07.488Z level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-09-27T05:31:07.488Z level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
time=2024-09-27T05:31:07.489Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=10 commit="eaf151c" tid="140015820541952" timestamp=1727415067
INFO [main] system info | n_threads=48 n_threads_batch=48 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140015820541952" timestamp=1727415067 total_threads=96
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="95" port="39693" tid="140015820541952" timestamp=1727415067
llama_model_loader: loaded meta data with 35 key-value pairs and 963 tensors from /root/.ollama/models/blobs/sha256-6e7fdda508e91cb0f63de5c15ff79ac63a1584ccafd751c07ca12b7f442101b8 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 72B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5
llama_model_loader: - kv   5:                         general.size_label str              = 72B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = qwen
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-7...
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen2.5 72B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-72B
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  14:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  15:                          qwen2.block_count u32              = 80
llama_model_loader: - kv  16:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  17:                     qwen2.embedding_length u32              = 8192
llama_model_loader: - kv  18:                  qwen2.feed_forward_length u32              = 29568
llama_model_loader: - kv  19:                 qwen2.attention.head_count u32              = 64
llama_model_loader: - kv  20:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  21:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                          general.file_type u32              = 15
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  401 tensors
llama_model_loader: - type q5_0:   40 tensors
llama_model_loader: - type q8_0:   40 tensors
llama_model_loader: - type q4_K:  401 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   41 tensors
time=2024-09-27T05:31:07.741Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 29568
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 72.71 B
llm_load_print_meta: model size       = 44.15 GiB (5.22 BPW) 
llm_load_print_meta: general.name     = Qwen2.5 72B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    1.27 MiB

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.12

Originally created by @somnifex on GitHub (Sep 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6994 ### What is the issue? Whether using ollama run or curl to use the model, it is impossible to load the model into GPU memory `docker logs ollama` for starting and loading the ollama model are as follows ```bash 2024/09/27 05:29:20 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:10 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:20 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2024-09-27T05:29:20.148Z level=INFO source=images.go:753 msg="total blobs: 24" time=2024-09-27T05:29:20.148Z level=INFO source=images.go:760 msg="total unused blobs removed: 0" time=2024-09-27T05:29:20.148Z level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)" time=2024-09-27T05:29:20.149Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v12 cpu cpu_avx cpu_avx2 cuda_v11]" time=2024-09-27T05:29:20.149Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs" time=2024-09-27T05:29:24.814Z level=INFO source=types.go:107 msg="inference compute" id=GPU-e8ee7d42-72a9-d27d-ef76-dfa4df69bf0f library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA RTX A6000" total="47.5 GiB" available="47.3 GiB" time=2024-09-27T05:29:24.814Z level=INFO source=types.go:107 msg="inference compute" id=GPU-2325557b-dcde-10cd-b219-60ed716aa9ef library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA RTX A6000" total="47.5 GiB" available="47.3 GiB" [GIN] 2024/09/27 - 05:31:07 | 200 | 60.09µs | 127.0.0.1 | HEAD "/" [GIN] 2024/09/27 - 05:31:07 | 200 | 58.078071ms | 127.0.0.1 | POST "/api/show" time=2024-09-27T05:31:07.477Z level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/root/.ollama/models/blobs/sha256-6e7fdda508e91cb0f63de5c15ff79ac63a1584ccafd751c07ca12b7f442101b8 library=cuda parallel=20 required="71.1 GiB" time=2024-09-27T05:31:07.477Z level=INFO source=server.go:103 msg="system memory" total="503.5 GiB" free="495.5 GiB" free_swap="0 B" time=2024-09-27T05:31:07.481Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=81 layers.split=41,40 memory.available="[47.3 GiB 47.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="71.1 GiB" memory.required.partial="71.1 GiB" memory.required.kv="12.5 GiB" memory.required.allocations="[36.1 GiB 35.0 GiB]" memory.weights.total="55.0 GiB" memory.weights.repeating="54.1 GiB" memory.weights.nonrepeating="974.6 MiB" memory.graph.full="6.4 GiB" memory.graph.partial="6.4 GiB" time=2024-09-27T05:31:07.488Z level=INFO source=server.go:388 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6e7fdda508e91cb0f63de5c15ff79ac63a1584ccafd751c07ca12b7f442101b8 --ctx-size 40960 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --parallel 20 --tensor-split 41,40 --port 39693" time=2024-09-27T05:31:07.488Z level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-09-27T05:31:07.488Z level=INFO source=server.go:587 msg="waiting for llama runner to start responding" time=2024-09-27T05:31:07.489Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=10 commit="eaf151c" tid="140015820541952" timestamp=1727415067 INFO [main] system info | n_threads=48 n_threads_batch=48 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140015820541952" timestamp=1727415067 total_threads=96 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="95" port="39693" tid="140015820541952" timestamp=1727415067 llama_model_loader: loaded meta data with 35 key-value pairs and 963 tensors from /root/.ollama/models/blobs/sha256-6e7fdda508e91cb0f63de5c15ff79ac63a1584ccafd751c07ca12b7f442101b8 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen2.5 72B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Qwen2.5 llama_model_loader: - kv 5: general.size_label str = 72B llama_model_loader: - kv 6: general.license str = other llama_model_loader: - kv 7: general.license.name str = qwen llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-7... llama_model_loader: - kv 9: general.base_model.count u32 = 1 llama_model_loader: - kv 10: general.base_model.0.name str = Qwen2.5 72B llama_model_loader: - kv 11: general.base_model.0.organization str = Qwen llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-72B llama_model_loader: - kv 13: general.tags arr[str,2] = ["chat", "text-generation"] llama_model_loader: - kv 14: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 15: qwen2.block_count u32 = 80 llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 llama_model_loader: - kv 17: qwen2.embedding_length u32 = 8192 llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 29568 llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 64 llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 23: general.file_type u32 = 15 llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 34: general.quantization_version u32 = 2 llama_model_loader: - type f32: 401 tensors llama_model_loader: - type q5_0: 40 tensors llama_model_loader: - type q8_0: 40 tensors llama_model_loader: - type q4_K: 401 tensors llama_model_loader: - type q5_K: 40 tensors llama_model_loader: - type q6_K: 41 tensors time=2024-09-27T05:31:07.741Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 22 llm_load_vocab: token to piece cache size = 0.9310 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 29568 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 72.71 B llm_load_print_meta: model size = 44.15 GiB (5.22 BPW) llm_load_print_meta: general.name = Qwen2.5 72B Instruct llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 1.27 MiB ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.12
GiteaMirror added the bug label 2026-04-22 09:42:49 -05:00
Author
Owner

@somnifex commented on GitHub (Sep 27, 2024):

I vaguely found the reason; it might be due to my disk I/O speed being too slow. I will migrate to an SSD and try again

<!-- gh-comment-id:2378448210 --> @somnifex commented on GitHub (Sep 27, 2024): I vaguely found the reason; it might be due to my disk I/O speed being too slow. I will migrate to an SSD and try again
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#30188