[GH-ISSUE #9245] did not use gpu #52535

Closed
opened 2026-04-28 23:37:16 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @Hsq12138 on GitHub (Feb 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9245

What is the issue?

I found that when running the deepseek14b model, it only uses the CPU, with the GPU usage at 0 in the task manager.

Relevant log output

2025/02/20 15:45:52 routes.go:1186: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-02-20T15:45:52.074+08:00 level=INFO source=images.go:432 msg="total blobs: 7"
time=2025-02-20T15:45:52.075+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
time=2025-02-20T15:45:52.075+08:00 level=INFO source=routes.go:1237 msg="Listening on [::]:11434 (version 0.5.11)"
time=2025-02-20T15:45:52.075+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-02-20T15:45:52.075+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-02-20T15:45:52.075+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12
time=2025-02-20T15:45:52.167+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd library=cuda compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4080 SUPER" overhead="176.2 MiB"
time=2025-02-20T15:45:52.168+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4080 SUPER" total="16.0 GiB" available="14.7 GiB"
[GIN] 2025/02/20 - 15:49:16 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/02/20 - 15:49:16 | 200 |       525.4µs |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/02/20 - 15:49:31 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/02/20 - 15:49:31 | 200 |     15.0919ms |       127.0.0.1 | POST     "/api/show"
time=2025-02-20T15:49:31.524+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=D:\ollama\models\blobs\sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e gpu=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd parallel=4 available=15652315136 required="10.8 GiB"
time=2025-02-20T15:49:31.536+08:00 level=INFO source=server.go:100 msg="system memory" total="95.8 GiB" free="86.6 GiB" free_swap="88.1 GiB"
time=2025-02-20T15:49:31.537+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[14.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.8 GiB" memory.required.partial="10.8 GiB" memory.required.kv="1.5 GiB" memory.required.allocations="[10.8 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.3 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB"
time=2025-02-20T15:49:31.543+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\ollama\\models\\blobs\\sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 6 --no-mmap --parallel 4 --port 50269"
time=2025-02-20T15:49:31.594+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-02-20T15:49:31.594+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
time=2025-02-20T15:49:31.594+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
time=2025-02-20T15:49:31.610+08:00 level=INFO source=runner.go:936 msg="starting go runner"
time=2025-02-20T15:49:31.616+08:00 level=INFO source=runner.go:937 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(clang)" threads=6
time=2025-02-20T15:49:31.616+08:00 level=INFO source=runner.go:995 msg="Server listening on 127.0.0.1:50269"
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll
llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from D:\ollama\models\blobs\sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 14B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 14B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 48
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 13824
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 14B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 14.77 B
llm_load_print_meta: model size       = 8.37 GiB (4.87 BPW) 
llm_load_print_meta: general.name     = DeepSeek R1 Distill Qwen 14B
llm_load_print_meta: BOS token        = 151646 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: EOT token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors:          CPU model buffer size =  8566.04 MiB
time=2025-02-20T15:49:31.845+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_seq_max     = 4
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =  1536.00 MiB
llama_new_context_with_model: KV self size  = 1536.00 MiB, K (f16):  768.00 MiB, V (f16):  768.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     2.40 MiB
llama_new_context_with_model:        CPU compute buffer size =   696.01 MiB
llama_new_context_with_model: graph nodes  = 1686
llama_new_context_with_model: graph splits = 1
[GIN] 2025/02/20 - 15:49:38 | 200 |    7.1239561s |       127.0.0.1 | POST     "/api/generate"
time=2025-02-20T15:49:38.605+08:00 level=INFO source=server.go:596 msg="llama runner started in 7.01 seconds"
[GIN] 2025/02/20 - 15:49:54 | 200 |     7.687977s |       127.0.0.1 | POST     "/api/chat"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.5.11

Originally created by @Hsq12138 on GitHub (Feb 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9245 ### What is the issue? I found that when running the deepseek14b model, it only uses the CPU, with the GPU usage at 0 in the task manager. ### Relevant log output ```shell 2025/02/20 15:45:52 routes.go:1186: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-02-20T15:45:52.074+08:00 level=INFO source=images.go:432 msg="total blobs: 7" time=2025-02-20T15:45:52.075+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" time=2025-02-20T15:45:52.075+08:00 level=INFO source=routes.go:1237 msg="Listening on [::]:11434 (version 0.5.11)" time=2025-02-20T15:45:52.075+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-02-20T15:45:52.075+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-02-20T15:45:52.075+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12 time=2025-02-20T15:45:52.167+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd library=cuda compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4080 SUPER" overhead="176.2 MiB" time=2025-02-20T15:45:52.168+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4080 SUPER" total="16.0 GiB" available="14.7 GiB" [GIN] 2025/02/20 - 15:49:16 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/02/20 - 15:49:16 | 200 | 525.4µs | 127.0.0.1 | GET "/api/tags" [GIN] 2025/02/20 - 15:49:31 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/02/20 - 15:49:31 | 200 | 15.0919ms | 127.0.0.1 | POST "/api/show" time=2025-02-20T15:49:31.524+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=D:\ollama\models\blobs\sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e gpu=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd parallel=4 available=15652315136 required="10.8 GiB" time=2025-02-20T15:49:31.536+08:00 level=INFO source=server.go:100 msg="system memory" total="95.8 GiB" free="86.6 GiB" free_swap="88.1 GiB" time=2025-02-20T15:49:31.537+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[14.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.8 GiB" memory.required.partial="10.8 GiB" memory.required.kv="1.5 GiB" memory.required.allocations="[10.8 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.3 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB" time=2025-02-20T15:49:31.543+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\ollama\\models\\blobs\\sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 6 --no-mmap --parallel 4 --port 50269" time=2025-02-20T15:49:31.594+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-02-20T15:49:31.594+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding" time=2025-02-20T15:49:31.594+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error" time=2025-02-20T15:49:31.610+08:00 level=INFO source=runner.go:936 msg="starting go runner" time=2025-02-20T15:49:31.616+08:00 level=INFO source=runner.go:937 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(clang)" threads=6 time=2025-02-20T15:49:31.616+08:00 level=INFO source=runner.go:995 msg="Server listening on 127.0.0.1:50269" ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from D:\ollama\models\blobs\sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 14B llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen llama_model_loader: - kv 4: general.size_label str = 14B llama_model_loader: - kv 5: qwen2.block_count u32 = 48 llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 13824 llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40 llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 13: general.file_type u32 = 15 llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type q4_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 22 llm_load_vocab: token to piece cache size = 0.9310 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 5 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 13824 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 14B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 14.77 B llm_load_print_meta: model size = 8.37 GiB (4.87 BPW) llm_load_print_meta: general.name = DeepSeek R1 Distill Qwen 14B llm_load_print_meta: BOS token = 151646 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: EOT token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' llm_load_print_meta: EOG token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' llm_load_print_meta: EOG token = 151663 '<|repo_name|>' llm_load_print_meta: EOG token = 151664 '<|file_sep|>' llm_load_print_meta: max token length = 256 llm_load_tensors: CPU model buffer size = 8566.04 MiB time=2025-02-20T15:49:31.845+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" llama_new_context_with_model: n_seq_max = 4 llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1 llama_kv_cache_init: CPU KV buffer size = 1536.00 MiB llama_new_context_with_model: KV self size = 1536.00 MiB, K (f16): 768.00 MiB, V (f16): 768.00 MiB llama_new_context_with_model: CPU output buffer size = 2.40 MiB llama_new_context_with_model: CPU compute buffer size = 696.01 MiB llama_new_context_with_model: graph nodes = 1686 llama_new_context_with_model: graph splits = 1 [GIN] 2025/02/20 - 15:49:38 | 200 | 7.1239561s | 127.0.0.1 | POST "/api/generate" time=2025-02-20T15:49:38.605+08:00 level=INFO source=server.go:596 msg="llama runner started in 7.01 seconds" [GIN] 2025/02/20 - 15:49:54 | 200 | 7.687977s | 127.0.0.1 | POST "/api/chat" ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.11
GiteaMirror added the bug label 2026-04-28 23:37:16 -05:00
Author
Owner

@Hsq12138 commented on GitHub (Feb 20, 2025):

my gpu is 4080super cuda 12.8

<!-- gh-comment-id:2670737801 --> @Hsq12138 commented on GitHub (Feb 20, 2025): my gpu is 4080super cuda 12.8
Author
Owner

@rick-github commented on GitHub (Feb 20, 2025):

ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll

ollama can't find the backends. Maybe try re-installing.

<!-- gh-comment-id:2671122445 --> @rick-github commented on GitHub (Feb 20, 2025): ``` ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll ``` ollama can't find the backends. Maybe try re-installing.
Author
Owner

@Hsq12138 commented on GitHub (Feb 21, 2025):

ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll

ollama can't find the backends. Maybe try re-installing.

i tried many times of re installing this problem still exsit

<!-- gh-comment-id:2673378989 --> @Hsq12138 commented on GitHub (Feb 21, 2025): > ``` > ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll > ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll > ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll > ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll > ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll > ``` > > ollama can't find the backends. Maybe try re-installing. i tried many times of re installing this problem still exsit
Author
Owner

@lry789 commented on GitHub (Feb 21, 2025):

I have the same problem

<!-- gh-comment-id:2673537878 --> @lry789 commented on GitHub (Feb 21, 2025): I have the same problem
Author
Owner

@rick-github commented on GitHub (Feb 21, 2025):

Set OLLAMA_DEBUG=1 in the server environment and post the resulting logs.

<!-- gh-comment-id:2673913601 --> @rick-github commented on GitHub (Feb 21, 2025): Set `OLLAMA_DEBUG=1` in the [server environment](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server) and post the resulting logs.
Author
Owner

@lry789 commented on GitHub (Feb 21, 2025):

-- Logs begin at Thu 2023-11-16 11:05:42 CST. --
2月 21 13:31:11 tc ollama[285585]: CUDA driver version: 12.3
2月 21 13:31:11 tc ollama[285585]: calling cuDeviceGetCount
2月 21 13:31:11 tc ollama[285585]: device count 1
2月 21 13:31:11 tc ollama[285585]: time=2025-02-21T13:31:11.754+08:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.545.23.08
2月 21 13:31:11 tc ollama[285585]: [GPU-433dad3d-903e-13f1-25aa-8888bfae0c86] CUDA totalMem 16076 mb
2月 21 13:31:11 tc ollama[285585]: [GPU-433dad3d-903e-13f1-25aa-8888bfae0c86] CUDA freeMem 15811 mb
2月 21 13:31:11 tc ollama[285585]: [GPU-433dad3d-903e-13f1-25aa-8888bfae0c86] Compute Capability 8.9
2月 21 13:31:11 tc ollama[285585]: time=2025-02-21T13:31:11.961+08:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu"
2月 21 13:31:11 tc ollama[285585]: releasing cuda driver library
2月 21 13:31:11 tc ollama[285585]: time=2025-02-21T13:31:11.961+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-433dad3d-903e-13f1-25aa-8888bfae0c86 library=cuda variant=v12 compute=8.9 driver=12.3 name="NVIDIA GeForce RTX 4080" total="15.7 GiB" available="15.4 GiB"


2月 21 13:33:07 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:07 | 200 |     794.342µs |  192.168.30.134 | GET      "/api/tags"
2月 21 13:33:07 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:07 | 200 |      608.09µs |  192.168.30.134 | GET      "/api/tags"
2月 21 13:33:07 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:07 | 200 |      73.143µs |  192.168.30.134 | GET      "/"
2月 21 13:33:10 tc ollama[285585]: time=2025-02-21T13:33:10.934+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.7 GiB" before.free="60.5 GiB" before.free_swap="2.0 GiB" now.total="62.7 GiB" now.free="60.4 GiB" now.free_swap="2.0 GiB"
2月 21 13:33:10 tc ollama[285585]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.545.23.08
2月 21 13:33:10 tc ollama[285585]: dlsym: cuInit - 0x7fc0ae644c30
2月 21 13:33:10 tc ollama[285585]: dlsym: cuDriverGetVersion - 0x7fc0ae644c50
2月 21 13:33:10 tc ollama[285585]: dlsym: cuDeviceGetCount - 0x7fc0ae644c90
2月 21 13:33:10 tc ollama[285585]: dlsym: cuDeviceGet - 0x7fc0ae644c70
2月 21 13:33:10 tc ollama[285585]: dlsym: cuDeviceGetAttribute - 0x7fc0ae644d70
2月 21 13:33:10 tc ollama[285585]: dlsym: cuDeviceGetUuid - 0x7fc0ae644cd0
2月 21 13:33:10 tc ollama[285585]: dlsym: cuDeviceGetName - 0x7fc0ae644cb0
2月 21 13:33:10 tc ollama[285585]: dlsym: cuCtxCreate_v3 - 0x7fc0ae644f50
2月 21 13:33:10 tc ollama[285585]: dlsym: cuMemGetInfo_v2 - 0x7fc0ae64ec40
2月 21 13:33:10 tc ollama[285585]: dlsym: cuCtxDestroy - 0x7fc0ae69e380
2月 21 13:33:10 tc ollama[285585]: calling cuInit
2月 21 13:33:10 tc ollama[285585]: calling cuDriverGetVersion
2月 21 13:33:10 tc ollama[285585]: raw version 0x2efe
2月 21 13:33:10 tc ollama[285585]: CUDA driver version: 12.3
2月 21 13:33:10 tc ollama[285585]: calling cuDeviceGetCount
2月 21 13:33:10 tc ollama[285585]: device count 1
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.121+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-433dad3d-903e-13f1-25aa-8888bfae0c86 name="NVIDIA GeForce RTX 4080" overhead="0 B" before.total="15.7 GiB" before.free="15.4 GiB" now.total="15.7 GiB" now.free="15.4 GiB" now.used="265.1 MiB"
2月 21 13:33:11 tc ollama[285585]: releasing cuda driver library
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.164+08:00 level=DEBUG source=sched.go:224 msg="loading first model" model=/home/tc/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.164+08:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[15.4 GiB]"
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.166+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/home/tc/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e gpu=GPU-433dad3d-903e-13f1-25aa-8888bfae0c86 parallel=1 available=16579887104 required="9.2 GiB"
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.166+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.7 GiB" before.free="60.4 GiB" before.free_swap="2.0 GiB" now.total="62.7 GiB" now.free="60.4 GiB" now.free_swap="2.0 GiB"
2月 21 13:33:11 tc ollama[285585]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.545.23.08
2月 21 13:33:11 tc ollama[285585]: dlsym: cuInit - 0x7fc0ae644c30
2月 21 13:33:11 tc ollama[285585]: dlsym: cuDriverGetVersion - 0x7fc0ae644c50
2月 21 13:33:11 tc ollama[285585]: dlsym: cuDeviceGetCount - 0x7fc0ae644c90
2月 21 13:33:11 tc ollama[285585]: dlsym: cuDeviceGet - 0x7fc0ae644c70
2月 21 13:33:11 tc ollama[285585]: dlsym: cuDeviceGetAttribute - 0x7fc0ae644d70
2月 21 13:33:11 tc ollama[285585]: dlsym: cuDeviceGetUuid - 0x7fc0ae644cd0
2月 21 13:33:11 tc ollama[285585]: dlsym: cuDeviceGetName - 0x7fc0ae644cb0
2月 21 13:33:11 tc ollama[285585]: dlsym: cuCtxCreate_v3 - 0x7fc0ae644f50
2月 21 13:33:11 tc ollama[285585]: dlsym: cuMemGetInfo_v2 - 0x7fc0ae64ec40
2月 21 13:33:11 tc ollama[285585]: dlsym: cuCtxDestroy - 0x7fc0ae69e380
2月 21 13:33:11 tc ollama[285585]: calling cuInit
2月 21 13:33:11 tc ollama[285585]: calling cuDriverGetVersion
2月 21 13:33:11 tc ollama[285585]: raw version 0x2efe
2月 21 13:33:11 tc ollama[285585]: CUDA driver version: 12.3
2月 21 13:33:11 tc ollama[285585]: calling cuDeviceGetCount
2月 21 13:33:11 tc ollama[285585]: device count 1
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.342+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-433dad3d-903e-13f1-25aa-8888bfae0c86 name="NVIDIA GeForce RTX 4080" overhead="0 B" before.total="15.7 GiB" before.free="15.4 GiB" now.total="15.7 GiB" now.free="15.4 GiB" now.used="265.1 MiB"
2月 21 13:33:11 tc ollama[285585]: releasing cuda driver library
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.342+08:00 level=INFO source=server.go:100 msg="system memory" total="62.7 GiB" free="60.4 GiB" free_swap="2.0 GiB"
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.342+08:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[15.4 GiB]"
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.343+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[15.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="9.2 GiB" memory.required.partial="9.2 GiB" memory.required.kv="384.0 MiB" memory.required.allocations="[9.2 GiB]" memory.weights.total="7.7 GiB" memory.weights.repeating="7.1 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.343+08:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[]
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.343+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /home/tc/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e --ctx-size 2048 --batch-size 512 --n-gpu-layers 49 --verbose --threads 28 --parallel 1 --port 45965"
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.343+08:00 level=DEBUG source=server.go:398 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin CUDA_VISIBLE_DEVICES=GPU-433dad3d-903e-13f1-25aa-8888bfae0c86 LD_LIBRARY_PATH=/usr/local/bin]"
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.344+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.344+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.344+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.387+08:00 level=INFO source=runner.go:936 msg="starting go runner"
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.387+08:00 level=INFO source=runner.go:937 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=28
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.387+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=/usr/local/bin
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.387+08:00 level=INFO source=runner.go:995 msg="Server listening on 127.0.0.1:45965"
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from /home/tc/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest))
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv   0:                       general.architecture str              = qwen2
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv   1:                               general.type str              = model
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 14B
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv   4:                         general.size_label str              = 14B
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv   5:                          qwen2.block_count u32              = 48
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 5120
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 13824
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 40
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 8
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 1000000.000000
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  13:                          general.file_type u32              = 15
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv  25:               general.quantization_version u32              = 2
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - type  f32:  241 tensors
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - type q4_K:  289 tensors
2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - type q6_K:   49 tensors
2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.596+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151645 '<|Assistant|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151644 '<|User|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151647 '<|EOT|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: special tokens cache size = 22
2月 21 13:33:12 tc ollama[285585]: llm_load_vocab: token to piece cache size = 0.9310 MB
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: format           = GGUF V3 (latest)
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: arch             = qwen2
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: vocab type       = BPE
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_vocab          = 152064
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_merges         = 151387
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: vocab_only       = 0
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_ctx_train      = 131072
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_embd           = 5120
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_layer          = 48
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_head           = 40
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_head_kv        = 8
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_rot            = 128
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_swa            = 0
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_embd_head_k    = 128
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_embd_head_v    = 128
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_gqa            = 5
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_embd_k_gqa     = 1024
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_embd_v_gqa     = 1024
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: f_norm_eps       = 0.0e+00
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: f_logit_scale    = 0.0e+00
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_ff             = 13824
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_expert         = 0
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_expert_used    = 0
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: causal attn      = 1
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: pooling type     = 0
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: rope type        = 2
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: rope scaling     = linear
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: freq_base_train  = 1000000.0
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: freq_scale_train = 1
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: rope_finetuned   = unknown
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: ssm_d_conv       = 0
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: ssm_d_inner      = 0
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: ssm_d_state      = 0
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: ssm_dt_rank      = 0
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: model type       = 14B
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: model ftype      = Q4_K - Medium
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: model params     = 14.77 B
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: model size       = 8.37 GiB (4.87 BPW)
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: general.name     = DeepSeek R1 Distill Qwen 14B
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: BOS token        = 151646 '<|begin▁of▁sentence|>'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: EOS token        = 151643 '<|end▁of▁sentence|>'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: EOT token        = 151643 '<|end▁of▁sentence|>'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: PAD token        = 151643 '<|end▁of▁sentence|>'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: LF token         = 148848 'ÄĬ'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: EOG token        = 151643 '<|end▁of▁sentence|>'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: max token length = 256
2月 21 13:33:12 tc ollama[285585]: llm_load_tensors:   CPU_Mapped model buffer size =  8566.04 MiB
2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: n_seq_max     = 1
2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: n_ctx         = 2048
2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: n_ctx_per_seq = 2048
2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: n_batch       = 512
2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: n_ubatch      = 512
2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: flash_attn    = 0
2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: freq_base     = 1000000.0
2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: freq_scale    = 1
2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 40: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 41: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 42: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 43: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 44: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 45: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 46: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 47: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
2月 21 13:33:12 tc ollama[285585]: time=2025-02-21T13:33:12.601+08:00 level=DEBUG source=server.go:602 msg="model load progress 1.00"
2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init:        CPU KV buffer size =   384.00 MiB
2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model:        CPU  output buffer size =     0.60 MiB
2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model:        CPU compute buffer size =   307.00 MiB
2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: graph nodes  = 1686
2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: graph splits = 1
2月 21 13:33:12 tc ollama[285585]: time=2025-02-21T13:33:12.852+08:00 level=INFO source=server.go:596 msg="llama runner started in 1.51 seconds"
2月 21 13:33:12 tc ollama[285585]: time=2025-02-21T13:33:12.852+08:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/home/tc/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e
2月 21 13:33:12 tc ollama[285585]: time=2025-02-21T13:33:12.852+08:00 level=DEBUG source=routes.go:1461 msg="chat request" images=0 prompt="<|User|>\n你好<|Assistant|>"
2月 21 13:33:12 tc ollama[285585]: time=2025-02-21T13:33:12.857+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=5 used=0 remaining=5
2月 21 13:33:20 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:20 | 200 |    1.734805ms |  192.168.30.134 | GET      "/api/tags"
2月 21 13:33:20 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:20 | 200 |     508.613µs |  192.168.30.134 | GET      "/api/tags"
2月 21 13:33:22 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:22 | 200 | 11.844414043s |  192.168.30.134 | POST     "/api/chat"
2月 21 13:33:22 tc ollama[285585]: time=2025-02-21T13:33:22.747+08:00 level=DEBUG source=sched.go:466 msg="context for request finished"
2月 21 13:33:22 tc ollama[285585]: time=2025-02-21T13:33:22.747+08:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/tc/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e duration=24h0m0s
2月 21 13:33:22 tc ollama[285585]: time=2025-02-21T13:33:22.747+08:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/home/tc/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e refCount=0
2月 21 13:33:23 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:23 | 200 |     664.578µs |  192.168.30.134 | GET      "/api/tags"
2月 21 13:33:23 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:23 | 200 |     572.873µs |  192.168.30.134 | GET      "/api/tags"
2月 21 13:33:31 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:31 | 200 |     560.001µs |  192.168.30.134 | GET      "/api/tags"
2月 21 13:33:31 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:31 | 200 |     467.113µs |  192.168.30.134 | GET      "/api/tags"
2月 21 13:37:58 tc ollama[285585]: [GIN] 2025/02/21 - 13:37:58 | 200 |     761.399µs |  192.168.30.134 | GET      "/api/tags"

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
Environment="OLLAMA_GPU=1"
Environment="OLLAMA_DEBUG=1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_ORIGIN="*""
Environment="OLLAMA_MODELS=/home/hide/.ollama/models"
Environment="OLLAMA_HOST=192.168.30.138:16006"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="OLLAMA_GPU_LAYERS=100"
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin"

[Install]
WantedBy=default.target
<!-- gh-comment-id:2674043194 --> @lry789 commented on GitHub (Feb 21, 2025): ``` -- Logs begin at Thu 2023-11-16 11:05:42 CST. -- 2月 21 13:31:11 tc ollama[285585]: CUDA driver version: 12.3 2月 21 13:31:11 tc ollama[285585]: calling cuDeviceGetCount 2月 21 13:31:11 tc ollama[285585]: device count 1 2月 21 13:31:11 tc ollama[285585]: time=2025-02-21T13:31:11.754+08:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.545.23.08 2月 21 13:31:11 tc ollama[285585]: [GPU-433dad3d-903e-13f1-25aa-8888bfae0c86] CUDA totalMem 16076 mb 2月 21 13:31:11 tc ollama[285585]: [GPU-433dad3d-903e-13f1-25aa-8888bfae0c86] CUDA freeMem 15811 mb 2月 21 13:31:11 tc ollama[285585]: [GPU-433dad3d-903e-13f1-25aa-8888bfae0c86] Compute Capability 8.9 2月 21 13:31:11 tc ollama[285585]: time=2025-02-21T13:31:11.961+08:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu" 2月 21 13:31:11 tc ollama[285585]: releasing cuda driver library 2月 21 13:31:11 tc ollama[285585]: time=2025-02-21T13:31:11.961+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-433dad3d-903e-13f1-25aa-8888bfae0c86 library=cuda variant=v12 compute=8.9 driver=12.3 name="NVIDIA GeForce RTX 4080" total="15.7 GiB" available="15.4 GiB" 2月 21 13:33:07 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:07 | 200 | 794.342µs | 192.168.30.134 | GET "/api/tags" 2月 21 13:33:07 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:07 | 200 | 608.09µs | 192.168.30.134 | GET "/api/tags" 2月 21 13:33:07 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:07 | 200 | 73.143µs | 192.168.30.134 | GET "/" 2月 21 13:33:10 tc ollama[285585]: time=2025-02-21T13:33:10.934+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.7 GiB" before.free="60.5 GiB" before.free_swap="2.0 GiB" now.total="62.7 GiB" now.free="60.4 GiB" now.free_swap="2.0 GiB" 2月 21 13:33:10 tc ollama[285585]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.545.23.08 2月 21 13:33:10 tc ollama[285585]: dlsym: cuInit - 0x7fc0ae644c30 2月 21 13:33:10 tc ollama[285585]: dlsym: cuDriverGetVersion - 0x7fc0ae644c50 2月 21 13:33:10 tc ollama[285585]: dlsym: cuDeviceGetCount - 0x7fc0ae644c90 2月 21 13:33:10 tc ollama[285585]: dlsym: cuDeviceGet - 0x7fc0ae644c70 2月 21 13:33:10 tc ollama[285585]: dlsym: cuDeviceGetAttribute - 0x7fc0ae644d70 2月 21 13:33:10 tc ollama[285585]: dlsym: cuDeviceGetUuid - 0x7fc0ae644cd0 2月 21 13:33:10 tc ollama[285585]: dlsym: cuDeviceGetName - 0x7fc0ae644cb0 2月 21 13:33:10 tc ollama[285585]: dlsym: cuCtxCreate_v3 - 0x7fc0ae644f50 2月 21 13:33:10 tc ollama[285585]: dlsym: cuMemGetInfo_v2 - 0x7fc0ae64ec40 2月 21 13:33:10 tc ollama[285585]: dlsym: cuCtxDestroy - 0x7fc0ae69e380 2月 21 13:33:10 tc ollama[285585]: calling cuInit 2月 21 13:33:10 tc ollama[285585]: calling cuDriverGetVersion 2月 21 13:33:10 tc ollama[285585]: raw version 0x2efe 2月 21 13:33:10 tc ollama[285585]: CUDA driver version: 12.3 2月 21 13:33:10 tc ollama[285585]: calling cuDeviceGetCount 2月 21 13:33:10 tc ollama[285585]: device count 1 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.121+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-433dad3d-903e-13f1-25aa-8888bfae0c86 name="NVIDIA GeForce RTX 4080" overhead="0 B" before.total="15.7 GiB" before.free="15.4 GiB" now.total="15.7 GiB" now.free="15.4 GiB" now.used="265.1 MiB" 2月 21 13:33:11 tc ollama[285585]: releasing cuda driver library 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.164+08:00 level=DEBUG source=sched.go:224 msg="loading first model" model=/home/tc/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.164+08:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[15.4 GiB]" 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.166+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/home/tc/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e gpu=GPU-433dad3d-903e-13f1-25aa-8888bfae0c86 parallel=1 available=16579887104 required="9.2 GiB" 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.166+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.7 GiB" before.free="60.4 GiB" before.free_swap="2.0 GiB" now.total="62.7 GiB" now.free="60.4 GiB" now.free_swap="2.0 GiB" 2月 21 13:33:11 tc ollama[285585]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.545.23.08 2月 21 13:33:11 tc ollama[285585]: dlsym: cuInit - 0x7fc0ae644c30 2月 21 13:33:11 tc ollama[285585]: dlsym: cuDriverGetVersion - 0x7fc0ae644c50 2月 21 13:33:11 tc ollama[285585]: dlsym: cuDeviceGetCount - 0x7fc0ae644c90 2月 21 13:33:11 tc ollama[285585]: dlsym: cuDeviceGet - 0x7fc0ae644c70 2月 21 13:33:11 tc ollama[285585]: dlsym: cuDeviceGetAttribute - 0x7fc0ae644d70 2月 21 13:33:11 tc ollama[285585]: dlsym: cuDeviceGetUuid - 0x7fc0ae644cd0 2月 21 13:33:11 tc ollama[285585]: dlsym: cuDeviceGetName - 0x7fc0ae644cb0 2月 21 13:33:11 tc ollama[285585]: dlsym: cuCtxCreate_v3 - 0x7fc0ae644f50 2月 21 13:33:11 tc ollama[285585]: dlsym: cuMemGetInfo_v2 - 0x7fc0ae64ec40 2月 21 13:33:11 tc ollama[285585]: dlsym: cuCtxDestroy - 0x7fc0ae69e380 2月 21 13:33:11 tc ollama[285585]: calling cuInit 2月 21 13:33:11 tc ollama[285585]: calling cuDriverGetVersion 2月 21 13:33:11 tc ollama[285585]: raw version 0x2efe 2月 21 13:33:11 tc ollama[285585]: CUDA driver version: 12.3 2月 21 13:33:11 tc ollama[285585]: calling cuDeviceGetCount 2月 21 13:33:11 tc ollama[285585]: device count 1 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.342+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-433dad3d-903e-13f1-25aa-8888bfae0c86 name="NVIDIA GeForce RTX 4080" overhead="0 B" before.total="15.7 GiB" before.free="15.4 GiB" now.total="15.7 GiB" now.free="15.4 GiB" now.used="265.1 MiB" 2月 21 13:33:11 tc ollama[285585]: releasing cuda driver library 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.342+08:00 level=INFO source=server.go:100 msg="system memory" total="62.7 GiB" free="60.4 GiB" free_swap="2.0 GiB" 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.342+08:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[15.4 GiB]" 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.343+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[15.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="9.2 GiB" memory.required.partial="9.2 GiB" memory.required.kv="384.0 MiB" memory.required.allocations="[9.2 GiB]" memory.weights.total="7.7 GiB" memory.weights.repeating="7.1 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB" 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.343+08:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[] 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.343+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /home/tc/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e --ctx-size 2048 --batch-size 512 --n-gpu-layers 49 --verbose --threads 28 --parallel 1 --port 45965" 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.343+08:00 level=DEBUG source=server.go:398 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin CUDA_VISIBLE_DEVICES=GPU-433dad3d-903e-13f1-25aa-8888bfae0c86 LD_LIBRARY_PATH=/usr/local/bin]" 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.344+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.344+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding" 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.344+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error" 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.387+08:00 level=INFO source=runner.go:936 msg="starting go runner" 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.387+08:00 level=INFO source=runner.go:937 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=28 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.387+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=/usr/local/bin 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.387+08:00 level=INFO source=runner.go:995 msg="Server listening on 127.0.0.1:45965" 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from /home/tc/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest)) 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 0: general.architecture str = qwen2 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 1: general.type str = model 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 14B 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 4: general.size_label str = 14B 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 5: qwen2.block_count u32 = 48 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 13824 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 13: general.file_type u32 = 15 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - kv 25: general.quantization_version u32 = 2 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - type f32: 241 tensors 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - type q4_K: 289 tensors 2月 21 13:33:11 tc ollama[285585]: llama_model_loader: - type q6_K: 49 tensors 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.596+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151645 '<|Assistant|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151644 '<|User|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151647 '<|EOT|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect 2月 21 13:33:11 tc ollama[285585]: llm_load_vocab: special tokens cache size = 22 2月 21 13:33:12 tc ollama[285585]: llm_load_vocab: token to piece cache size = 0.9310 MB 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: format = GGUF V3 (latest) 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: arch = qwen2 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: vocab type = BPE 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_vocab = 152064 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_merges = 151387 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: vocab_only = 0 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_ctx_train = 131072 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_embd = 5120 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_layer = 48 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_head = 40 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_head_kv = 8 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_rot = 128 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_swa = 0 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_embd_head_k = 128 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_embd_head_v = 128 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_gqa = 5 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_embd_k_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: f_norm_eps = 0.0e+00 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: f_logit_scale = 0.0e+00 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_ff = 13824 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_expert = 0 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_expert_used = 0 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: causal attn = 1 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: pooling type = 0 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: rope type = 2 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: rope scaling = linear 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: freq_base_train = 1000000.0 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: freq_scale_train = 1 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: n_ctx_orig_yarn = 131072 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: rope_finetuned = unknown 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: ssm_d_conv = 0 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: ssm_d_inner = 0 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: ssm_d_state = 0 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: ssm_dt_rank = 0 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: ssm_dt_b_c_rms = 0 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: model type = 14B 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: model ftype = Q4_K - Medium 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: model params = 14.77 B 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: model size = 8.37 GiB (4.87 BPW) 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: general.name = DeepSeek R1 Distill Qwen 14B 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: BOS token = 151646 '<|begin▁of▁sentence|>' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: EOS token = 151643 '<|end▁of▁sentence|>' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: EOT token = 151643 '<|end▁of▁sentence|>' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: PAD token = 151643 '<|end▁of▁sentence|>' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: LF token = 148848 'ÄĬ' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: EOG token = 151643 '<|end▁of▁sentence|>' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2月 21 13:33:12 tc ollama[285585]: llm_load_print_meta: max token length = 256 2月 21 13:33:12 tc ollama[285585]: llm_load_tensors: CPU_Mapped model buffer size = 8566.04 MiB 2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: n_seq_max = 1 2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: n_ctx = 2048 2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: n_ctx_per_seq = 2048 2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: n_batch = 512 2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: n_ubatch = 512 2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: flash_attn = 0 2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: freq_base = 1000000.0 2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: freq_scale = 1 2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 40: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 41: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 42: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 43: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 44: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 45: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 46: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: layer 47: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 2月 21 13:33:12 tc ollama[285585]: time=2025-02-21T13:33:12.601+08:00 level=DEBUG source=server.go:602 msg="model load progress 1.00" 2月 21 13:33:12 tc ollama[285585]: llama_kv_cache_init: CPU KV buffer size = 384.00 MiB 2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB 2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: CPU output buffer size = 0.60 MiB 2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: CPU compute buffer size = 307.00 MiB 2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: graph nodes = 1686 2月 21 13:33:12 tc ollama[285585]: llama_new_context_with_model: graph splits = 1 2月 21 13:33:12 tc ollama[285585]: time=2025-02-21T13:33:12.852+08:00 level=INFO source=server.go:596 msg="llama runner started in 1.51 seconds" 2月 21 13:33:12 tc ollama[285585]: time=2025-02-21T13:33:12.852+08:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/home/tc/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e 2月 21 13:33:12 tc ollama[285585]: time=2025-02-21T13:33:12.852+08:00 level=DEBUG source=routes.go:1461 msg="chat request" images=0 prompt="<|User|>\n你好<|Assistant|>" 2月 21 13:33:12 tc ollama[285585]: time=2025-02-21T13:33:12.857+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=5 used=0 remaining=5 2月 21 13:33:20 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:20 | 200 | 1.734805ms | 192.168.30.134 | GET "/api/tags" 2月 21 13:33:20 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:20 | 200 | 508.613µs | 192.168.30.134 | GET "/api/tags" 2月 21 13:33:22 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:22 | 200 | 11.844414043s | 192.168.30.134 | POST "/api/chat" 2月 21 13:33:22 tc ollama[285585]: time=2025-02-21T13:33:22.747+08:00 level=DEBUG source=sched.go:466 msg="context for request finished" 2月 21 13:33:22 tc ollama[285585]: time=2025-02-21T13:33:22.747+08:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/tc/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e duration=24h0m0s 2月 21 13:33:22 tc ollama[285585]: time=2025-02-21T13:33:22.747+08:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/home/tc/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e refCount=0 2月 21 13:33:23 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:23 | 200 | 664.578µs | 192.168.30.134 | GET "/api/tags" 2月 21 13:33:23 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:23 | 200 | 572.873µs | 192.168.30.134 | GET "/api/tags" 2月 21 13:33:31 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:31 | 200 | 560.001µs | 192.168.30.134 | GET "/api/tags" 2月 21 13:33:31 tc ollama[285585]: [GIN] 2025/02/21 - 13:33:31 | 200 | 467.113µs | 192.168.30.134 | GET "/api/tags" 2月 21 13:37:58 tc ollama[285585]: [GIN] 2025/02/21 - 13:37:58 | 200 | 761.399µs | 192.168.30.134 | GET "/api/tags" ``` --------------------------- ``` [Unit] Description=Ollama Service After=network-online.target [Service] Environment="OLLAMA_GPU=1" Environment="OLLAMA_DEBUG=1" Environment="OLLAMA_NUM_PARALLEL=1" Environment="OLLAMA_MAX_LOADED_MODELS=1" Environment="OLLAMA_KEEP_ALIVE=24h" Environment="OLLAMA_ORIGIN="*"" Environment="OLLAMA_MODELS=/home/hide/.ollama/models" Environment="OLLAMA_HOST=192.168.30.138:16006" Environment="CUDA_VISIBLE_DEVICES=0" Environment="OLLAMA_GPU_LAYERS=100" ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin" [Install] WantedBy=default.target ```
Author
Owner

@rick-github commented on GitHub (Feb 21, 2025):

OLLAMA_GPU_LAYERS is not an ollama environment variable.

2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.387+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=/usr/local/bin

ollama couldn't find an backends to load. What's the output of:

ls -l /usr/local/lib /usr/local/lib/ollama

How did you install ollama?

<!-- gh-comment-id:2674083212 --> @rick-github commented on GitHub (Feb 21, 2025): `OLLAMA_GPU_LAYERS` is not an ollama environment variable. ``` 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.387+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=/usr/local/bin ``` ollama couldn't find an backends to load. What's the output of: ``` ls -l /usr/local/lib /usr/local/lib/ollama ``` How did you install ollama?
Author
Owner

@Hsq12138 commented on GitHub (Feb 21, 2025):

OLLAMA_GPU_LAYERS is not an ollama environment variable.

2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.387+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=/usr/local/bin

ollama couldn't find an backends to load. What's the output of:

ls -l /usr/local/lib /usr/local/lib/ollama

How did you install ollama?

2025/02/21 23:08:26 routes.go:1187: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\ollama\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-02-21T23:08:26.801+08:00 level=INFO source=images.go:432 msg="total blobs: 2"
time=2025-02-21T23:08:26.802+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
time=2025-02-21T23:08:26.802+08:00 level=INFO source=routes.go:1238 msg="Listening on 127.0.0.1:11434 (version 0.5.12-rc1)"
time=2025-02-21T23:08:26.802+08:00 level=DEBUG source=sched.go:106 msg="starting llm scheduler"
time=2025-02-21T23:08:26.803+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-02-21T23:08:26.803+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-02-21T23:08:26.803+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12
time=2025-02-21T23:08:26.803+08:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-02-21T23:08:26.803+08:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll
time=2025-02-21T23:08:26.803+08:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\nvml.dll C:\Program Files\NVIDIA\CUDNN\v9.7\bin\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\libnvvp\nvml.dll C:\Users\zrway\AppData\Local\Programs\Ollama\nvml.dll C:\Program Files\Common Files\Oracle\Java\javapath\nvml.dll C:\Windows\system32\nvml.dll C:\Windows\nvml.dll C:\Windows\System32\Wbem\nvml.dll C:\Windows\System32\WindowsPowerShell\v1.0\nvml.dll C:\Windows\System32\OpenSSH\nvml.dll C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvml.dll C:\Program Files\Bandizip\nvml.dll C:\Program Files\dotnet\nvml.dll C:\Program Files\Git\cmd\nvml.dll D:\ai\ffmpeg-2024-03-28-git-5d71f97e0e-full_build\bin\nvml.dll C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR\nvml.dll C:\Users\zrway\AppData\Local\Microsoft\WindowsApps\python.exe\nvml.dll C:\Users\zrway\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts\nvml.dll C:\Program Files\NVIDIA Corporation\Nsight Compute 2025.1.0\nvml.dll C:\Program Files\MySQL\MySQL Shell 8.0\bin\nvml.dll C:\Users\zrway\AppData\Local\Microsoft\WindowsApps\nvml.dll C:\Users\zrway\AppData\Local\Programs\Microsoft VS Code\bin\nvml.dll D:\ai\ffmpeg-2024-03-28-git-5d71f97e0e-essentials_build\bin\nvml.dll C:\Users\zrway\AppData\Local\Programs\Ollama\nvml.dll C:\Users\zrway\AppData\Local\Programs\Ollama\nvml.dll C:\Users\zrway\.lmstudio\bin\nvml.dll c:\Windows\System32\nvml.dll]"
time=2025-02-21T23:08:26.804+08:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvml.dll"
time=2025-02-21T23:08:26.805+08:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\Windows\system32\nvml.dll c:\Windows\System32\nvml.dll]"
time=2025-02-21T23:08:26.839+08:00 level=DEBUG source=gpu.go:111 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll
time=2025-02-21T23:08:26.839+08:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvcuda.dll
time=2025-02-21T23:08:26.839+08:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\nvcuda.dll C:\Program Files\NVIDIA\CUDNN\v9.7\bin\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\libnvvp\nvcuda.dll C:\Users\zrway\AppData\Local\Programs\Ollama\nvcuda.dll C:\Program Files\Common Files\Oracle\Java\javapath\nvcuda.dll C:\Windows\system32\nvcuda.dll C:\Windows\nvcuda.dll C:\Windows\System32\Wbem\nvcuda.dll C:\Windows\System32\WindowsPowerShell\v1.0\nvcuda.dll C:\Windows\System32\OpenSSH\nvcuda.dll C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvcuda.dll C:\Program Files\Bandizip\nvcuda.dll C:\Program Files\dotnet\nvcuda.dll C:\Program Files\Git\cmd\nvcuda.dll D:\ai\ffmpeg-2024-03-28-git-5d71f97e0e-full_build\bin\nvcuda.dll C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR\nvcuda.dll C:\Users\zrway\AppData\Local\Microsoft\WindowsApps\python.exe\nvcuda.dll C:\Users\zrway\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts\nvcuda.dll C:\Program Files\NVIDIA Corporation\Nsight Compute 2025.1.0\nvcuda.dll C:\Program Files\MySQL\MySQL Shell 8.0\bin\nvcuda.dll C:\Users\zrway\AppData\Local\Microsoft\WindowsApps\nvcuda.dll C:\Users\zrway\AppData\Local\Programs\Microsoft VS Code\bin\nvcuda.dll D:\ai\ffmpeg-2024-03-28-git-5d71f97e0e-essentials_build\bin\nvcuda.dll C:\Users\zrway\AppData\Local\Programs\Ollama\nvcuda.dll C:\Users\zrway\AppData\Local\Programs\Ollama\nvcuda.dll C:\Users\zrway\.lmstudio\bin\nvcuda.dll c:\windows\system
\nvcuda.dll]"
time=2025-02-21T23:08:26.840+08:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvcuda.dll"
time=2025-02-21T23:08:26.841+08:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll]
initializing C:\Windows\system32\nvcuda.dll
dlsym: cuInit - 00007FF97B775F80
dlsym: cuDriverGetVersion - 00007FF97B776020
dlsym: cuDeviceGetCount - 00007FF97B776816
dlsym: cuDeviceGet - 00007FF97B776810
dlsym: cuDeviceGetAttribute - 00007FF97B776170
dlsym: cuDeviceGetUuid - 00007FF97B776822
dlsym: cuDeviceGetName - 00007FF97B77681C
dlsym: cuCtxCreate_v3 - 00007FF97B776894
dlsym: cuMemGetInfo_v2 - 00007FF97B776996
dlsym: cuCtxDestroy - 00007FF97B7768A6
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 1
time=2025-02-21T23:08:26.893+08:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=C:\Windows\system32\nvcuda.dll
[GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd] CUDA totalMem 16375 mb
[GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd] CUDA freeMem 15035 mb
[GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd] Compute Capability 8.9
time=2025-02-21T23:08:26.975+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd library=cuda compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4080 SUPER" overhead="265.6 MiB"
time=2025-02-21T23:08:26.976+08:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The file cannot be accessed by the system."
releasing cuda driver library
releasing nvml library
time=2025-02-21T23:08:26.977+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4080 SUPER" total="16.0 GiB" available="14.7 GiB"
[GIN] 2025/02/21 - 23:18:16 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/02/21 - 23:18:16 | 200 | 1.0301ms | 127.0.0.1 | GET "/api/tags"
[GIN] 2025/02/21 - 23:18:31 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/02/21 - 23:18:31 | 200 | 10.8992ms | 127.0.0.1 | POST "/api/show"
time=2025-02-21T23:18:31.919+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.8 GiB" before.free="87.3 GiB" before.free_swap="89.6 GiB" now.total="95.8 GiB" now.free="86.2 GiB" now.free_swap="87.3 GiB"
time=2025-02-21T23:18:31.928+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd name="NVIDIA GeForce RTX 4080 SUPER" overhead="265.6 MiB" before.total="16.0 GiB" before.free="14.7 GiB" now.total="16.0 GiB" now.free="14.4 GiB" now.used="1.3 GiB"
releasing nvml library
time=2025-02-21T23:18:31.928+08:00 level=DEBUG source=sched.go:182 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-02-21T23:18:31.946+08:00 level=DEBUG source=sched.go:225 msg="loading first model" model=D:\ollama\models\blobs\sha256-553aa261cfb6856c595c9fefdb5453b98fdef331bf2ca918a5e0a23aa254d022
time=2025-02-21T23:18:31.946+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[14.4 GiB]"
time=2025-02-21T23:18:31.946+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-02-21T23:18:31.946+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-02-21T23:18:31.947+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[14.4 GiB]"
time=2025-02-21T23:18:31.947+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-02-21T23:18:31.947+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-02-21T23:18:31.948+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[14.4 GiB]"
time=2025-02-21T23:18:31.948+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-02-21T23:18:31.948+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-02-21T23:18:31.948+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[14.4 GiB]"
time=2025-02-21T23:18:31.948+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-02-21T23:18:31.948+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-02-21T23:18:31.948+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.8 GiB" before.free="86.2 GiB" before.free_swap="87.3 GiB" now.total="95.8 GiB" now.free="86.2 GiB" now.free_swap="87.3 GiB"
time=2025-02-21T23:18:31.956+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd name="NVIDIA GeForce RTX 4080 SUPER" overhead="265.6 MiB" before.total="16.0 GiB" before.free="14.4 GiB" now.total="16.0 GiB" now.free="14.4 GiB" now.used="1.3 GiB"
releasing nvml library
time=2025-02-21T23:18:31.956+08:00 level=INFO source=server.go:97 msg="system memory" total="95.8 GiB" free="86.2 GiB" free_swap="87.3 GiB"
time=2025-02-21T23:18:31.956+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[14.4 GiB]"
time=2025-02-21T23:18:31.956+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-02-21T23:18:31.956+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-02-21T23:18:31.957+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=24 layers.split="" memory.available="[14.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="28.3 GiB" memory.required.partial="14.4 GiB" memory.required.kv="384.0 MiB" memory.required.allocations="[14.4 GiB]" memory.weights.total="25.0 GiB" memory.weights.repeating="23.5 GiB" memory.weights.nonrepeating="1.5 GiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
time=2025-02-21T23:18:31.957+08:00 level=DEBUG source=server.go:259 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]"
time=2025-02-21T23:18:31.963+08:00 level=DEBUG source=server.go:302 msg="adding gpu library" path=C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
time=2025-02-21T23:18:31.963+08:00 level=DEBUG source=server.go:310 msg="adding gpu dependency paths" paths=[C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12]
time=2025-02-21T23:18:31.963+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\Users\zrway\AppData\Local\Programs\Ollama\ollama.exe runner --model D:\ollama\models\blobs\sha256-553aa261cfb6856c595c9fefdb5453b98fdef331bf2ca918a5e0a23aa254d022 --ctx-size 2048 --batch-size 512 --n-gpu-layers 24 --verbose --threads 6 --no-mmap --parallel 1 --port 50841"
time=2025-02-21T23:18:31.963+08:00 level=DEBUG source=server.go:398 msg=subprocess environment="[CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8 CUDA_PATH_V12_8=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8 PATH=C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12;C:\Program Files\NVIDIA\CUDNN\v9.7\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\libnvvp;;C:\Program Files\Common Files\Oracle\Java\javapath;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Bandizip\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;D:\ai\ffmpeg-2024-03-28-git-5d71f97e0e-full_build\bin;C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR;C:\Users\zrway\AppData\Local\Microsoft\WindowsApps\python.exe;C:\Users\zrway\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts;C:\Program Files\NVIDIA Corporation\Nsight Compute 2025.1.0\;C:\Program Files\MySQL\MySQL Shell 8.0\bin\;C:\Users\zrway\AppData\Local\Microsoft\WindowsApps;C:\Users\zrway\AppData\Local\Programs\Microsoft VS Code\bin;D:\ai\ffmpeg-2024-03-28-git-5d71f97e0e-essentials_build\bin;;C:\Users\zrway\AppData\Local\Programs\Ollama;C:\Users\zrway\.lmstudio\bin;C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12;C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama CUDA_VISIBLE_DEVICES=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd]"
time=2025-02-21T23:18:32.020+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-02-21T23:18:32.020+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
time=2025-02-21T23:18:32.020+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
time=2025-02-21T23:18:32.035+08:00 level=INFO source=runner.go:932 msg="starting go runner"
time=2025-02-21T23:18:32.041+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
time=2025-02-21T23:18:32.051+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\Program Files\NVIDIA\CUDNN\v9.7\bin"
time=2025-02-21T23:18:32.051+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin"
time=2025-02-21T23:18:32.051+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\libnvvp"
time=2025-02-21T23:18:32.051+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=C:\Users\zrway\AppData\Local\Programs\Ollama
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\Program Files\Common Files\Oracle\Java\javapath"
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Windows\system32
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Windows
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Windows\System32\Wbem
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Windows\System32\WindowsPowerShell\v1.0
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Windows\System32\OpenSSH
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common"
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\Program Files\Bandizip"
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\Program Files\dotnet"
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\Program Files\Git\cmd"
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=D:\ai\ffmpeg-2024-03-28-git-5d71f97e0e-full_build\bin
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR"
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Users\zrway\AppData\Local\Microsoft\WindowsApps\python.exe
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Users\zrway\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\Program Files\NVIDIA Corporation\Nsight Compute 2025.1.0"
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\Program Files\MySQL\MySQL Shell 8.0\bin"
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Users\zrway\AppData\Local\Microsoft\WindowsApps
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\Users\zrway\AppData\Local\Programs\Microsoft VS Code\bin"
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=D:\ai\ffmpeg-2024-03-28-git-5d71f97e0e-essentials_build\bin
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Users\zrway.lmstudio\bin
time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll
time=2025-02-21T23:18:32.060+08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(clang)" threads=6
time=2025-02-21T23:18:32.074+08:00 level=INFO source=runner.go:993 msg="Server listening on 127.0.0.1:50841"
llama_model_loader: loaded meta data with 25 key-value pairs and 579 tensors from D:\ollama\models\blobs\sha256-553aa261cfb6856c595c9fefdb5453b98fdef331bf2ca918a5e0a23aa254d022 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Checkpoint 887 Merged
llama_model_loader: - kv 3: general.size_label str = 15B
llama_model_loader: - kv 4: qwen2.block_count u32 = 48
llama_model_loader: - kv 5: qwen2.context_length u32 = 131072
llama_model_loader: - kv 6: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 7: qwen2.feed_forward_length u32 = 13824
llama_model_loader: - kv 8: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 9: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 1
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.pre str = deepseek-r1-qwen
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151646
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 23: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 24: general.quantization_version u32 = 2
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type f16: 338 tensors
llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG
llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG
llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG
llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG
llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG
llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG
llm_load_vocab: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
llm_load_vocab: control token: 151644 '<|User|>' is not marked as EOG
llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG
llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG
llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG
llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG
llm_load_vocab: control token: 151647 '<|EOT|>' is not marked as EOG
llm_load_vocab: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
llm_load_vocab: control token: 151645 '<|Assistant|>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 152064
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_layer = 48
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 5
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 14B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 14.77 B
llm_load_print_meta: model size = 27.51 GiB (16.00 BPW)
llm_load_print_meta: general.name = Checkpoint 887 Merged
llm_load_print_meta: BOS token = 151646 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: EOT token = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: CPU model buffer size = 28173.21 MiB
load_all_data: no device found for buffer type CPU for async uploads
time=2025-02-21T23:18:32.271+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
time=2025-02-21T23:18:33.022+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.05"
time=2025-02-21T23:18:33.523+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.11"
time=2025-02-21T23:18:33.773+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.13"
time=2025-02-21T23:18:34.023+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.15"
time=2025-02-21T23:18:34.274+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.17"
time=2025-02-21T23:18:34.524+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.18"
time=2025-02-21T23:18:34.774+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.20"
time=2025-02-21T23:18:35.024+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.22"
time=2025-02-21T23:18:35.275+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.24"
time=2025-02-21T23:18:35.525+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.26"
time=2025-02-21T23:18:35.775+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.28"
time=2025-02-21T23:18:36.025+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.30"
time=2025-02-21T23:18:36.275+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.32"
time=2025-02-21T23:18:36.526+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.34"
time=2025-02-21T23:18:36.777+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.36"
time=2025-02-21T23:18:37.028+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.38"
time=2025-02-21T23:18:37.278+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.40"
time=2025-02-21T23:18:37.528+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.42"
time=2025-02-21T23:18:37.778+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.44"
time=2025-02-21T23:18:38.028+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.45"
time=2025-02-21T23:18:38.279+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.47"
time=2025-02-21T23:18:38.530+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.50"
time=2025-02-21T23:18:38.780+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.52"
time=2025-02-21T23:18:39.030+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.54"
time=2025-02-21T23:18:39.280+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.56"
time=2025-02-21T23:18:39.531+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.58"
time=2025-02-21T23:18:39.781+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.59"
time=2025-02-21T23:18:40.032+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.61"
time=2025-02-21T23:18:40.282+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.63"
time=2025-02-21T23:18:40.532+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.65"
time=2025-02-21T23:18:40.783+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.67"
time=2025-02-21T23:18:41.033+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.69"
time=2025-02-21T23:18:41.283+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.71"
time=2025-02-21T23:18:41.534+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.72"
time=2025-02-21T23:18:41.784+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.75"
time=2025-02-21T23:18:42.035+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.77"
time=2025-02-21T23:18:42.285+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.79"
time=2025-02-21T23:18:42.535+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.80"
time=2025-02-21T23:18:42.786+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.82"
time=2025-02-21T23:18:43.036+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.85"
time=2025-02-21T23:18:43.286+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.86"
time=2025-02-21T23:18:43.536+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.88"
time=2025-02-21T23:18:43.787+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.91"
time=2025-02-21T23:18:44.037+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.93"
time=2025-02-21T23:18:44.288+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.95"
time=2025-02-21T23:18:44.538+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.97"
time=2025-02-21T23:18:44.788+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.99"
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 40: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 41: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 42: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 43: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 44: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 45: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 46: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 47: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: CPU KV buffer size = 384.00 MiB
llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.60 MiB
llama_new_context_with_model: CPU compute buffer size = 307.00 MiB
llama_new_context_with_model: graph nodes = 1686
llama_new_context_with_model: graph splits = 1
time=2025-02-21T23:18:45.039+08:00 level=INFO source=server.go:596 msg="llama runner started in 13.02 seconds"
time=2025-02-21T23:18:45.039+08:00 level=DEBUG source=sched.go:463 msg="finished setting up runner" model=D:\ollama\models\blobs\sha256-553aa261cfb6856c595c9fefdb5453b98fdef331bf2ca918a5e0a23aa254d022
[GIN] 2025/02/21 - 23:18:45 | 200 | 13.1287157s | 127.0.0.1 | POST "/api/generate"
time=2025-02-21T23:18:45.039+08:00 level=DEBUG source=sched.go:467 msg="context for request finished"
time=2025-02-21T23:18:45.039+08:00 level=DEBUG source=sched.go:340 msg="runner with non-zero duration has gone idle, adding timer" modelPath=D:\ollama\models\blobs\sha256-553aa261cfb6856c595c9fefdb5453b98fdef331bf2ca918a5e0a23aa254d022 duration=5m0s
time=2025-02-21T23:18:45.039+08:00 level=DEBUG source=sched.go:358 msg="after processing request finished event" modelPath=D:\ollama\models\blobs\sha256-553aa261cfb6856c595c9fefdb5453b98fdef331bf2ca918a5e0a23aa254d022 refCount=0
time=2025-02-21T23:18:54.108+08:00 level=DEBUG source=sched.go:576 msg="evaluating already loaded" model=D:\ollama\models\blobs\sha256-553aa261cfb6856c595c9fefdb5453b98fdef331bf2ca918a5e0a23aa254d022
time=2025-02-21T23:18:54.108+08:00 level=DEBUG source=routes.go:1462 msg="chat request" images=0 prompt=hi
time=2025-02-21T23:18:54.108+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=2 used=0 remaining=2

<!-- gh-comment-id:2674844067 --> @Hsq12138 commented on GitHub (Feb 21, 2025): > `OLLAMA_GPU_LAYERS` is not an ollama environment variable. > > ``` > 2月 21 13:33:11 tc ollama[285585]: time=2025-02-21T13:33:11.387+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=/usr/local/bin > ``` > > ollama couldn't find an backends to load. What's the output of: > > ``` > ls -l /usr/local/lib /usr/local/lib/ollama > ``` > > How did you install ollama? 2025/02/21 23:08:26 routes.go:1187: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-02-21T23:08:26.801+08:00 level=INFO source=images.go:432 msg="total blobs: 2" time=2025-02-21T23:08:26.802+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" time=2025-02-21T23:08:26.802+08:00 level=INFO source=routes.go:1238 msg="Listening on 127.0.0.1:11434 (version 0.5.12-rc1)" time=2025-02-21T23:08:26.802+08:00 level=DEBUG source=sched.go:106 msg="starting llm scheduler" time=2025-02-21T23:08:26.803+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-02-21T23:08:26.803+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-02-21T23:08:26.803+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12 time=2025-02-21T23:08:26.803+08:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-02-21T23:08:26.803+08:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll time=2025-02-21T23:08:26.803+08:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files\\NVIDIA\\CUDNN\\v9.7\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.8\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.8\\libnvvp\\nvml.dll C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Program Files\\Bandizip\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll D:\\ai\\ffmpeg-2024-03-28-git-5d71f97e0e-full_build\\bin\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR\\nvml.dll C:\\Users\\zrway\\AppData\\Local\\Microsoft\\WindowsApps\\python.exe\\nvml.dll C:\\Users\\zrway\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python311\\Scripts\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0\\nvml.dll C:\\Program Files\\MySQL\\MySQL Shell 8.0\\bin\\nvml.dll C:\\Users\\zrway\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\zrway\\AppData\\Local\\Programs\\Microsoft VS Code\\bin\\nvml.dll D:\\ai\\ffmpeg-2024-03-28-git-5d71f97e0e-essentials_build\\bin\\nvml.dll C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\zrway\\.lmstudio\\bin\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-02-21T23:08:26.804+08:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll" time=2025-02-21T23:08:26.805+08:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-02-21T23:08:26.839+08:00 level=DEBUG source=gpu.go:111 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll time=2025-02-21T23:08:26.839+08:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvcuda.dll time=2025-02-21T23:08:26.839+08:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files\\NVIDIA\\CUDNN\\v9.7\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.8\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.8\\libnvvp\\nvcuda.dll C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Program Files\\Bandizip\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll D:\\ai\\ffmpeg-2024-03-28-git-5d71f97e0e-full_build\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR\\nvcuda.dll C:\\Users\\zrway\\AppData\\Local\\Microsoft\\WindowsApps\\python.exe\\nvcuda.dll C:\\Users\\zrway\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python311\\Scripts\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0\\nvcuda.dll C:\\Program Files\\MySQL\\MySQL Shell 8.0\\bin\\nvcuda.dll C:\\Users\\zrway\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\zrway\\AppData\\Local\\Programs\\Microsoft VS Code\\bin\\nvcuda.dll D:\\ai\\ffmpeg-2024-03-28-git-5d71f97e0e-essentials_build\\bin\\nvcuda.dll C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\zrway\\.lmstudio\\bin\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]" time=2025-02-21T23:08:26.840+08:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll" time=2025-02-21T23:08:26.841+08:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll] initializing C:\Windows\system32\nvcuda.dll dlsym: cuInit - 00007FF97B775F80 dlsym: cuDriverGetVersion - 00007FF97B776020 dlsym: cuDeviceGetCount - 00007FF97B776816 dlsym: cuDeviceGet - 00007FF97B776810 dlsym: cuDeviceGetAttribute - 00007FF97B776170 dlsym: cuDeviceGetUuid - 00007FF97B776822 dlsym: cuDeviceGetName - 00007FF97B77681C dlsym: cuCtxCreate_v3 - 00007FF97B776894 dlsym: cuMemGetInfo_v2 - 00007FF97B776996 dlsym: cuCtxDestroy - 00007FF97B7768A6 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 1 time=2025-02-21T23:08:26.893+08:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=C:\Windows\system32\nvcuda.dll [GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd] CUDA totalMem 16375 mb [GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd] CUDA freeMem 15035 mb [GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd] Compute Capability 8.9 time=2025-02-21T23:08:26.975+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd library=cuda compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4080 SUPER" overhead="265.6 MiB" time=2025-02-21T23:08:26.976+08:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The file cannot be accessed by the system." releasing cuda driver library releasing nvml library time=2025-02-21T23:08:26.977+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4080 SUPER" total="16.0 GiB" available="14.7 GiB" [GIN] 2025/02/21 - 23:18:16 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/02/21 - 23:18:16 | 200 | 1.0301ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/02/21 - 23:18:31 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/02/21 - 23:18:31 | 200 | 10.8992ms | 127.0.0.1 | POST "/api/show" time=2025-02-21T23:18:31.919+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.8 GiB" before.free="87.3 GiB" before.free_swap="89.6 GiB" now.total="95.8 GiB" now.free="86.2 GiB" now.free_swap="87.3 GiB" time=2025-02-21T23:18:31.928+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd name="NVIDIA GeForce RTX 4080 SUPER" overhead="265.6 MiB" before.total="16.0 GiB" before.free="14.7 GiB" now.total="16.0 GiB" now.free="14.4 GiB" now.used="1.3 GiB" releasing nvml library time=2025-02-21T23:18:31.928+08:00 level=DEBUG source=sched.go:182 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-02-21T23:18:31.946+08:00 level=DEBUG source=sched.go:225 msg="loading first model" model=D:\ollama\models\blobs\sha256-553aa261cfb6856c595c9fefdb5453b98fdef331bf2ca918a5e0a23aa254d022 time=2025-02-21T23:18:31.946+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[14.4 GiB]" time=2025-02-21T23:18:31.946+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-02-21T23:18:31.946+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-02-21T23:18:31.947+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[14.4 GiB]" time=2025-02-21T23:18:31.947+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-02-21T23:18:31.947+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-02-21T23:18:31.948+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[14.4 GiB]" time=2025-02-21T23:18:31.948+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-02-21T23:18:31.948+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-02-21T23:18:31.948+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[14.4 GiB]" time=2025-02-21T23:18:31.948+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-02-21T23:18:31.948+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-02-21T23:18:31.948+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.8 GiB" before.free="86.2 GiB" before.free_swap="87.3 GiB" now.total="95.8 GiB" now.free="86.2 GiB" now.free_swap="87.3 GiB" time=2025-02-21T23:18:31.956+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd name="NVIDIA GeForce RTX 4080 SUPER" overhead="265.6 MiB" before.total="16.0 GiB" before.free="14.4 GiB" now.total="16.0 GiB" now.free="14.4 GiB" now.used="1.3 GiB" releasing nvml library time=2025-02-21T23:18:31.956+08:00 level=INFO source=server.go:97 msg="system memory" total="95.8 GiB" free="86.2 GiB" free_swap="87.3 GiB" time=2025-02-21T23:18:31.956+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[14.4 GiB]" time=2025-02-21T23:18:31.956+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-02-21T23:18:31.956+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-02-21T23:18:31.957+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=24 layers.split="" memory.available="[14.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="28.3 GiB" memory.required.partial="14.4 GiB" memory.required.kv="384.0 MiB" memory.required.allocations="[14.4 GiB]" memory.weights.total="25.0 GiB" memory.weights.repeating="23.5 GiB" memory.weights.nonrepeating="1.5 GiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB" time=2025-02-21T23:18:31.957+08:00 level=DEBUG source=server.go:259 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]" time=2025-02-21T23:18:31.963+08:00 level=DEBUG source=server.go:302 msg="adding gpu library" path=C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 time=2025-02-21T23:18:31.963+08:00 level=DEBUG source=server.go:310 msg="adding gpu dependency paths" paths=[C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12] time=2025-02-21T23:18:31.963+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\ollama\\models\\blobs\\sha256-553aa261cfb6856c595c9fefdb5453b98fdef331bf2ca918a5e0a23aa254d022 --ctx-size 2048 --batch-size 512 --n-gpu-layers 24 --verbose --threads 6 --no-mmap --parallel 1 --port 50841" time=2025-02-21T23:18:31.963+08:00 level=DEBUG source=server.go:398 msg=subprocess environment="[CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.8 CUDA_PATH_V12_8=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.8 PATH=C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Program Files\\NVIDIA\\CUDNN\\v9.7\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.8\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.8\\libnvvp;;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Bandizip\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\Git\\cmd;D:\\ai\\ffmpeg-2024-03-28-git-5d71f97e0e-full_build\\bin;C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR;C:\\Users\\zrway\\AppData\\Local\\Microsoft\\WindowsApps\\python.exe;C:\\Users\\zrway\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python311\\Scripts;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0\\;C:\\Program Files\\MySQL\\MySQL Shell 8.0\\bin\\;C:\\Users\\zrway\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\zrway\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;D:\\ai\\ffmpeg-2024-03-28-git-5d71f97e0e-essentials_build\\bin;;C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama;C:\\Users\\zrway\\.lmstudio\\bin;C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\zrway\\AppData\\Local\\Programs\\Ollama\\lib\\ollama CUDA_VISIBLE_DEVICES=GPU-ecc0382b-7d7c-7b61-8572-a21b10ac9fcd]" time=2025-02-21T23:18:32.020+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-02-21T23:18:32.020+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding" time=2025-02-21T23:18:32.020+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error" time=2025-02-21T23:18:32.035+08:00 level=INFO source=runner.go:932 msg="starting go runner" time=2025-02-21T23:18:32.041+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 time=2025-02-21T23:18:32.051+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\\Program Files\\NVIDIA\\CUDNN\\v9.7\\bin" time=2025-02-21T23:18:32.051+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.8\\bin" time=2025-02-21T23:18:32.051+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.8\\libnvvp" time=2025-02-21T23:18:32.051+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=C:\Users\zrway\AppData\Local\Programs\Ollama time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Common Files\\Oracle\\Java\\javapath" time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Windows\system32 time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Windows time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Windows\System32\Wbem time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Windows\System32\WindowsPowerShell\v1.0 time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Windows\System32\OpenSSH time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common" time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Bandizip" time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\\Program Files\\dotnet" time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Git\\cmd" time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=D:\ai\ffmpeg-2024-03-28-git-5d71f97e0e-full_build\bin time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR" time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Users\zrway\AppData\Local\Microsoft\WindowsApps\python.exe time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Users\zrway\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0" time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\\Program Files\\MySQL\\MySQL Shell 8.0\\bin" time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Users\zrway\AppData\Local\Microsoft\WindowsApps time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path="C:\\Users\\zrway\\AppData\\Local\\Programs\\Microsoft VS Code\\bin" time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=D:\ai\ffmpeg-2024-03-28-git-5d71f97e0e-essentials_build\bin time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=C:\Users\zrway\.lmstudio\bin time=2025-02-21T23:18:32.052+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll ggml_backend_load_best: failed to load C:\Users\zrway\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll time=2025-02-21T23:18:32.060+08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(clang)" threads=6 time=2025-02-21T23:18:32.074+08:00 level=INFO source=runner.go:993 msg="Server listening on 127.0.0.1:50841" llama_model_loader: loaded meta data with 25 key-value pairs and 579 tensors from D:\ollama\models\blobs\sha256-553aa261cfb6856c595c9fefdb5453b98fdef331bf2ca918a5e0a23aa254d022 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Checkpoint 887 Merged llama_model_loader: - kv 3: general.size_label str = 15B llama_model_loader: - kv 4: qwen2.block_count u32 = 48 llama_model_loader: - kv 5: qwen2.context_length u32 = 131072 llama_model_loader: - kv 6: qwen2.embedding_length u32 = 5120 llama_model_loader: - kv 7: qwen2.feed_forward_length u32 = 13824 llama_model_loader: - kv 8: qwen2.attention.head_count u32 = 40 llama_model_loader: - kv 9: qwen2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 1 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = deepseek-r1-qwen llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151646 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 23: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 24: general.quantization_version u32 = 2 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type f16: 338 tensors llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default' llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG llm_load_vocab: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG llm_load_vocab: control token: 151644 '<|User|>' is not marked as EOG llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG llm_load_vocab: control token: 151647 '<|EOT|>' is not marked as EOG llm_load_vocab: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG llm_load_vocab: control token: 151645 '<|Assistant|>' is not marked as EOG llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 22 llm_load_vocab: token to piece cache size = 0.9310 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 5 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 13824 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 14B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 14.77 B llm_load_print_meta: model size = 27.51 GiB (16.00 BPW) llm_load_print_meta: general.name = Checkpoint 887 Merged llm_load_print_meta: BOS token = 151646 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: EOT token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' llm_load_print_meta: EOG token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' llm_load_print_meta: EOG token = 151663 '<|repo_name|>' llm_load_print_meta: EOG token = 151664 '<|file_sep|>' llm_load_print_meta: max token length = 256 llm_load_tensors: CPU model buffer size = 28173.21 MiB load_all_data: no device found for buffer type CPU for async uploads time=2025-02-21T23:18:32.271+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" time=2025-02-21T23:18:33.022+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.05" time=2025-02-21T23:18:33.523+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.11" time=2025-02-21T23:18:33.773+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.13" time=2025-02-21T23:18:34.023+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.15" time=2025-02-21T23:18:34.274+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.17" time=2025-02-21T23:18:34.524+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.18" time=2025-02-21T23:18:34.774+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.20" time=2025-02-21T23:18:35.024+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.22" time=2025-02-21T23:18:35.275+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.24" time=2025-02-21T23:18:35.525+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.26" time=2025-02-21T23:18:35.775+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.28" time=2025-02-21T23:18:36.025+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.30" time=2025-02-21T23:18:36.275+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.32" time=2025-02-21T23:18:36.526+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.34" time=2025-02-21T23:18:36.777+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.36" time=2025-02-21T23:18:37.028+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.38" time=2025-02-21T23:18:37.278+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.40" time=2025-02-21T23:18:37.528+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.42" time=2025-02-21T23:18:37.778+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.44" time=2025-02-21T23:18:38.028+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.45" time=2025-02-21T23:18:38.279+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.47" time=2025-02-21T23:18:38.530+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.50" time=2025-02-21T23:18:38.780+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.52" time=2025-02-21T23:18:39.030+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.54" time=2025-02-21T23:18:39.280+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.56" time=2025-02-21T23:18:39.531+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.58" time=2025-02-21T23:18:39.781+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.59" time=2025-02-21T23:18:40.032+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.61" time=2025-02-21T23:18:40.282+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.63" time=2025-02-21T23:18:40.532+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.65" time=2025-02-21T23:18:40.783+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.67" time=2025-02-21T23:18:41.033+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.69" time=2025-02-21T23:18:41.283+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.71" time=2025-02-21T23:18:41.534+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.72" time=2025-02-21T23:18:41.784+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.75" time=2025-02-21T23:18:42.035+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.77" time=2025-02-21T23:18:42.285+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.79" time=2025-02-21T23:18:42.535+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.80" time=2025-02-21T23:18:42.786+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.82" time=2025-02-21T23:18:43.036+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.85" time=2025-02-21T23:18:43.286+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.86" time=2025-02-21T23:18:43.536+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.88" time=2025-02-21T23:18:43.787+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.91" time=2025-02-21T23:18:44.037+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.93" time=2025-02-21T23:18:44.288+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.95" time=2025-02-21T23:18:44.538+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.97" time=2025-02-21T23:18:44.788+08:00 level=DEBUG source=server.go:602 msg="model load progress 0.99" llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1 llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 40: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 41: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 42: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 43: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 44: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 45: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 46: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 47: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: CPU KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB llama_new_context_with_model: CPU output buffer size = 0.60 MiB llama_new_context_with_model: CPU compute buffer size = 307.00 MiB llama_new_context_with_model: graph nodes = 1686 llama_new_context_with_model: graph splits = 1 time=2025-02-21T23:18:45.039+08:00 level=INFO source=server.go:596 msg="llama runner started in 13.02 seconds" time=2025-02-21T23:18:45.039+08:00 level=DEBUG source=sched.go:463 msg="finished setting up runner" model=D:\ollama\models\blobs\sha256-553aa261cfb6856c595c9fefdb5453b98fdef331bf2ca918a5e0a23aa254d022 [GIN] 2025/02/21 - 23:18:45 | 200 | 13.1287157s | 127.0.0.1 | POST "/api/generate" time=2025-02-21T23:18:45.039+08:00 level=DEBUG source=sched.go:467 msg="context for request finished" time=2025-02-21T23:18:45.039+08:00 level=DEBUG source=sched.go:340 msg="runner with non-zero duration has gone idle, adding timer" modelPath=D:\ollama\models\blobs\sha256-553aa261cfb6856c595c9fefdb5453b98fdef331bf2ca918a5e0a23aa254d022 duration=5m0s time=2025-02-21T23:18:45.039+08:00 level=DEBUG source=sched.go:358 msg="after processing request finished event" modelPath=D:\ollama\models\blobs\sha256-553aa261cfb6856c595c9fefdb5453b98fdef331bf2ca918a5e0a23aa254d022 refCount=0 time=2025-02-21T23:18:54.108+08:00 level=DEBUG source=sched.go:576 msg="evaluating already loaded" model=D:\ollama\models\blobs\sha256-553aa261cfb6856c595c9fefdb5453b98fdef331bf2ca918a5e0a23aa254d022 time=2025-02-21T23:18:54.108+08:00 level=DEBUG source=routes.go:1462 msg="chat request" images=0 prompt=hi time=2025-02-21T23:18:54.108+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=2 used=0 remaining=2
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52535