[GH-ISSUE #10738] How should I run multiple models in one time? #7052

Closed
opened 2026-04-12 18:58:06 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @NGC13009 on GitHub (May 16, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10738

What is the issue?

There are two environment variables related to concurrency:

OLLAMA NUM PARALLEL = 1               # A model will provide several responses simultaneously. Will cause the actual occupation of kvcache to become a multiple of its value
OLLAMA_MAX_LOADED_MODELS =3  # Number of different models running simultaneously

I want to start qwen3:14b and qwen2.5-coder: 1.5b at the same time. According to my testing, this configuration allows multiple models to run at once, and other models can coexist. However, qwen3 and qwen2.5 cannot be maintained simultaneously. Even if there is sufficient video memory.

Why is this?

Relevant log output

2025/05/16 23:38:08 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:3 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:/LLM/ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-05-16T23:38:08.701+08:00 level=INFO source=images.go:463 msg="total blobs: 26"
time=2025-05-16T23:38:08.702+08:00 level=INFO source=images.go:470 msg="total unused blobs removed: 0"
time=2025-05-16T23:38:08.702+08:00 level=INFO source=routes.go:1300 msg="Listening on 127.0.0.1:11434 (version 0.6.8)"
time=2025-05-16T23:38:08.702+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-05-16T23:38:08.702+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-05-16T23:38:08.702+08:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-05-16T23:38:08.702+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32
time=2025-05-16T23:38:08.963+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3080" total="20.0 GiB" available="18.8 GiB"
[GIN] 2025/05/16 - 23:38:12 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/05/16 - 23:38:12 | 200 |     17.4973ms |       127.0.0.1 | POST     "/api/show"
time=2025-05-16T23:38:12.802+08:00 level=INFO source=server.go:106 msg="system memory" total="63.7 GiB" free="44.4 GiB" free_swap="42.0 GiB"
time=2025-05-16T23:38:12.833+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=999 layers.model=29 layers.offload=29 layers.split="" memory.available="[18.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.7 GiB" memory.required.partial="1.7 GiB" memory.required.kv="28.0 MiB" memory.required.allocations="[1.7 GiB]" memory.weights.total="934.7 MiB" memory.weights.repeating="752.1 MiB" memory.weights.nonrepeating="182.6 MiB" memory.graph.full="299.8 MiB" memory.graph.partial="482.3 MiB"
time=2025-05-16T23:38:12.833+08:00 level=INFO source=server.go:186 msg="enabling flash attention"
llama_model_loader: loaded meta data with 34 key-value pairs and 338 tensors from E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 Coder 1.5B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 1.5B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 Coder 1.5B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv  12:                               general.tags arr[str,6]       = ["code", "codeqwen", "chat", "qwen", ...
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 28
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                          general.file_type u32              = 15
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 934.69 MiB (5.08 BPW) 
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 1.54 B
print_info: general.name     = Qwen2.5 Coder 1.5B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-05-16T23:38:12.966+08:00 level=INFO source=server.go:410 msg="starting llama server" cmd="C:\\application\\ollama\\OLLAMA_FILE\\ollama.exe runner --model E:\\LLM\\ollama_models\\blobs\\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 --ctx-size 2048 --batch-size 512 --n-gpu-layers 999 --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --port 53318"
time=2025-05-16T23:38:12.969+08:00 level=INFO source=sched.go:452 msg="loaded runners" count=1
time=2025-05-16T23:38:12.969+08:00 level=INFO source=server.go:589 msg="waiting for llama runner to start responding"
time=2025-05-16T23:38:12.969+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server error"
time=2025-05-16T23:38:12.989+08:00 level=INFO source=runner.go:853 msg="starting go runner"
load_backend: loaded CPU backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\ggml-cpu-alderlake.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-05-16T23:38:13.078+08:00 level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-05-16T23:38:13.078+08:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:53318"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080) - 19273 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 338 tensors from E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 Coder 1.5B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 1.5B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 Coder 1.5B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv  12:                               general.tags arr[str,6]       = ["code", "codeqwen", "chat", "qwen", ...
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 28
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                          general.file_type u32              = 15
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 934.69 MiB (5.08 BPW) 
time=2025-05-16T23:38:13.221+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model"
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 1536
print_info: n_layer          = 28
print_info: n_head           = 12
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 6
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8960
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1.5B
print_info: model params     = 1.54 B
print_info: general.name     = Qwen2.5 Coder 1.5B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:          CPU model buffer size =   182.57 MiB
load_tensors:        CUDA0 model buffer size =   934.70 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.59 MiB
init: kv_size = 2048, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 28, can_shift = 1
init:      CUDA0 KV buffer size =    29.75 MiB
llama_context: KV self size  =   29.75 MiB, K (q8_0):   14.88 MiB, V (q8_0):   14.88 MiB
llama_context:      CUDA0 compute buffer size =   299.75 MiB
llama_context:  CUDA_Host compute buffer size =     7.01 MiB
llama_context: graph nodes  = 931
llama_context: graph splits = 2
time=2025-05-16T23:38:13.972+08:00 level=INFO source=server.go:628 msg="llama runner started in 1.00 seconds"
[GIN] 2025/05/16 - 23:38:13 | 200 |     1.323746s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/05/16 - 23:38:18 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/05/16 - 23:38:18 | 200 |     31.2582ms |       127.0.0.1 | POST     "/api/show"
time=2025-05-16T23:38:18.923+08:00 level=INFO source=sched.go:517 msg="updated VRAM based on existing loaded models" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a library=cuda total="20.0 GiB" available="17.1 GiB"
time=2025-05-16T23:38:19.446+08:00 level=INFO source=server.go:106 msg="system memory" total="63.7 GiB" free="44.4 GiB" free_swap="42.0 GiB"
time=2025-05-16T23:38:19.478+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=999 layers.model=41 layers.offload=41 layers.split="" memory.available="[18.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="13.5 GiB" memory.required.partial="13.5 GiB" memory.required.kv="2.5 GiB" memory.required.allocations="[13.5 GiB]" memory.weights.total="8.2 GiB" memory.weights.repeating="7.6 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="2.1 GiB" memory.graph.partial="2.1 GiB"
time=2025-05-16T23:38:19.478+08:00 level=INFO source=server.go:186 msg="enabling flash attention"
llama_model_loader: loaded meta data with 27 key-value pairs and 443 tensors from E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 14B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 14B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 40
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 17408
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 40
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type  f16:   40 tensors
llama_model_loader: - type q4_K:  221 tensors
llama_model_loader: - type q6_K:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 8.63 GiB (5.02 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 14.77 B
print_info: general.name     = Qwen3 14B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-05-16T23:38:19.605+08:00 level=INFO source=server.go:410 msg="starting llama server" cmd="C:\\application\\ollama\\OLLAMA_FILE\\ollama.exe runner --model E:\\LLM\\ollama_models\\blobs\\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e --ctx-size 32768 --batch-size 512 --n-gpu-layers 999 --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --port 53341"
time=2025-05-16T23:38:19.608+08:00 level=INFO source=sched.go:452 msg="loaded runners" count=1
time=2025-05-16T23:38:19.608+08:00 level=INFO source=server.go:589 msg="waiting for llama runner to start responding"
time=2025-05-16T23:38:19.610+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server error"
time=2025-05-16T23:38:19.631+08:00 level=INFO source=runner.go:853 msg="starting go runner"
load_backend: loaded CPU backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\ggml-cpu-alderlake.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-05-16T23:38:19.718+08:00 level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-05-16T23:38:19.719+08:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:53341"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080) - 19273 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 443 tensors from E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 14B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 14B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 40
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 17408
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 40
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type  f16:   40 tensors
llama_model_loader: - type q4_K:  221 tensors
llama_model_loader: - type q6_K:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 8.63 GiB (5.02 BPW) 
time=2025-05-16T23:38:19.860+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model"
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 5120
print_info: n_layer          = 40
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 17408
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 14B
print_info: model params     = 14.77 B
print_info: general.name     = Qwen3 14B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:          CPU model buffer size =   417.30 MiB
load_tensors:        CUDA0 model buffer size =  8423.47 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (32768) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.60 MiB
init: kv_size = 32768, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 40, can_shift = 1
init:      CUDA0 KV buffer size =  2720.00 MiB
llama_context: KV self size  = 2720.00 MiB, K (q8_0): 1360.00 MiB, V (q8_0): 1360.00 MiB
llama_context:      CUDA0 compute buffer size =   306.75 MiB
llama_context:  CUDA_Host compute buffer size =    74.01 MiB
llama_context: graph nodes  = 1367
llama_context: graph splits = 2
time=2025-05-16T23:38:24.372+08:00 level=INFO source=server.go:628 msg="llama runner started in 4.76 seconds"
[GIN] 2025/05/16 - 23:38:24 | 200 |    5.5190813s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/05/16 - 23:38:27 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/05/16 - 23:38:27 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.6.8

Originally created by @NGC13009 on GitHub (May 16, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10738 ### What is the issue? There are two environment variables related to concurrency: ```python OLLAMA NUM PARALLEL = 1 # A model will provide several responses simultaneously. Will cause the actual occupation of kvcache to become a multiple of its value OLLAMA_MAX_LOADED_MODELS =3 # Number of different models running simultaneously ``` I want to start `qwen3:14b` and `qwen2.5-coder: 1.5b` at the same time. According to my testing, this configuration allows multiple models to run at once, and other models can coexist. However, `qwen3` and `qwen2.5` cannot be maintained simultaneously. Even if there is sufficient video memory. Why is this? ### Relevant log output ```shell 2025/05/16 23:38:08 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:3 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:/LLM/ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-05-16T23:38:08.701+08:00 level=INFO source=images.go:463 msg="total blobs: 26" time=2025-05-16T23:38:08.702+08:00 level=INFO source=images.go:470 msg="total unused blobs removed: 0" time=2025-05-16T23:38:08.702+08:00 level=INFO source=routes.go:1300 msg="Listening on 127.0.0.1:11434 (version 0.6.8)" time=2025-05-16T23:38:08.702+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-05-16T23:38:08.702+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-05-16T23:38:08.702+08:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-05-16T23:38:08.702+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32 time=2025-05-16T23:38:08.963+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3080" total="20.0 GiB" available="18.8 GiB" [GIN] 2025/05/16 - 23:38:12 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/05/16 - 23:38:12 | 200 | 17.4973ms | 127.0.0.1 | POST "/api/show" time=2025-05-16T23:38:12.802+08:00 level=INFO source=server.go:106 msg="system memory" total="63.7 GiB" free="44.4 GiB" free_swap="42.0 GiB" time=2025-05-16T23:38:12.833+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=999 layers.model=29 layers.offload=29 layers.split="" memory.available="[18.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.7 GiB" memory.required.partial="1.7 GiB" memory.required.kv="28.0 MiB" memory.required.allocations="[1.7 GiB]" memory.weights.total="934.7 MiB" memory.weights.repeating="752.1 MiB" memory.weights.nonrepeating="182.6 MiB" memory.graph.full="299.8 MiB" memory.graph.partial="482.3 MiB" time=2025-05-16T23:38:12.833+08:00 level=INFO source=server.go:186 msg="enabling flash attention" llama_model_loader: loaded meta data with 34 key-value pairs and 338 tensors from E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen2.5 Coder 1.5B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Qwen2.5-Coder llama_model_loader: - kv 5: general.size_label str = 1.5B llama_model_loader: - kv 6: general.license str = apache-2.0 llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-C... llama_model_loader: - kv 8: general.base_model.count u32 = 1 llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 Coder 1.5B llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-C... llama_model_loader: - kv 12: general.tags arr[str,6] = ["code", "codeqwen", "chat", "qwen", ... llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 14: qwen2.block_count u32 = 28 llama_model_loader: - kv 15: qwen2.context_length u32 = 32768 llama_model_loader: - kv 16: qwen2.embedding_length u32 = 1536 llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 8960 llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 12 llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 22: general.file_type u32 = 15 llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 33: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 934.69 MiB (5.08 BPW) load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 1.54 B print_info: general.name = Qwen2.5 Coder 1.5B Instruct print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-05-16T23:38:12.966+08:00 level=INFO source=server.go:410 msg="starting llama server" cmd="C:\\application\\ollama\\OLLAMA_FILE\\ollama.exe runner --model E:\\LLM\\ollama_models\\blobs\\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 --ctx-size 2048 --batch-size 512 --n-gpu-layers 999 --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --port 53318" time=2025-05-16T23:38:12.969+08:00 level=INFO source=sched.go:452 msg="loaded runners" count=1 time=2025-05-16T23:38:12.969+08:00 level=INFO source=server.go:589 msg="waiting for llama runner to start responding" time=2025-05-16T23:38:12.969+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server error" time=2025-05-16T23:38:12.989+08:00 level=INFO source=runner.go:853 msg="starting go runner" load_backend: loaded CPU backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\ggml-cpu-alderlake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-05-16T23:38:13.078+08:00 level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-05-16T23:38:13.078+08:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:53318" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080) - 19273 MiB free llama_model_loader: loaded meta data with 34 key-value pairs and 338 tensors from E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen2.5 Coder 1.5B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Qwen2.5-Coder llama_model_loader: - kv 5: general.size_label str = 1.5B llama_model_loader: - kv 6: general.license str = apache-2.0 llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-C... llama_model_loader: - kv 8: general.base_model.count u32 = 1 llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 Coder 1.5B llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-C... llama_model_loader: - kv 12: general.tags arr[str,6] = ["code", "codeqwen", "chat", "qwen", ... llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 14: qwen2.block_count u32 = 28 llama_model_loader: - kv 15: qwen2.context_length u32 = 32768 llama_model_loader: - kv 16: qwen2.embedding_length u32 = 1536 llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 8960 llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 12 llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 22: general.file_type u32 = 15 llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 33: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 934.69 MiB (5.08 BPW) time=2025-05-16T23:38:13.221+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model" load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 0 print_info: n_ctx_train = 32768 print_info: n_embd = 1536 print_info: n_layer = 28 print_info: n_head = 12 print_info: n_head_kv = 2 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 6 print_info: n_embd_k_gqa = 256 print_info: n_embd_v_gqa = 256 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 8960 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 32768 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 1.5B print_info: model params = 1.54 B print_info: general.name = Qwen2.5 Coder 1.5B Instruct print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: offloading 28 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 29/29 layers to GPU load_tensors: CPU model buffer size = 182.57 MiB load_tensors: CUDA0 model buffer size = 934.70 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 2048 llama_context: n_ctx_per_seq = 2048 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 1 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized llama_context: CUDA_Host output buffer size = 0.59 MiB init: kv_size = 2048, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 28, can_shift = 1 init: CUDA0 KV buffer size = 29.75 MiB llama_context: KV self size = 29.75 MiB, K (q8_0): 14.88 MiB, V (q8_0): 14.88 MiB llama_context: CUDA0 compute buffer size = 299.75 MiB llama_context: CUDA_Host compute buffer size = 7.01 MiB llama_context: graph nodes = 931 llama_context: graph splits = 2 time=2025-05-16T23:38:13.972+08:00 level=INFO source=server.go:628 msg="llama runner started in 1.00 seconds" [GIN] 2025/05/16 - 23:38:13 | 200 | 1.323746s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/05/16 - 23:38:18 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/05/16 - 23:38:18 | 200 | 31.2582ms | 127.0.0.1 | POST "/api/show" time=2025-05-16T23:38:18.923+08:00 level=INFO source=sched.go:517 msg="updated VRAM based on existing loaded models" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a library=cuda total="20.0 GiB" available="17.1 GiB" time=2025-05-16T23:38:19.446+08:00 level=INFO source=server.go:106 msg="system memory" total="63.7 GiB" free="44.4 GiB" free_swap="42.0 GiB" time=2025-05-16T23:38:19.478+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=999 layers.model=41 layers.offload=41 layers.split="" memory.available="[18.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="13.5 GiB" memory.required.partial="13.5 GiB" memory.required.kv="2.5 GiB" memory.required.allocations="[13.5 GiB]" memory.weights.total="8.2 GiB" memory.weights.repeating="7.6 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="2.1 GiB" memory.graph.partial="2.1 GiB" time=2025-05-16T23:38:19.478+08:00 level=INFO source=server.go:186 msg="enabling flash attention" llama_model_loader: loaded meta data with 27 key-value pairs and 443 tensors from E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 14B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 14B llama_model_loader: - kv 5: qwen3.block_count u32 = 40 llama_model_loader: - kv 6: qwen3.context_length u32 = 40960 llama_model_loader: - kv 7: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 17408 llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 40 llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: general.file_type u32 = 15 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type f16: 40 tensors llama_model_loader: - type q4_K: 221 tensors llama_model_loader: - type q6_K: 21 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 8.63 GiB (5.02 BPW) load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 14.77 B print_info: general.name = Qwen3 14B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-05-16T23:38:19.605+08:00 level=INFO source=server.go:410 msg="starting llama server" cmd="C:\\application\\ollama\\OLLAMA_FILE\\ollama.exe runner --model E:\\LLM\\ollama_models\\blobs\\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e --ctx-size 32768 --batch-size 512 --n-gpu-layers 999 --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --port 53341" time=2025-05-16T23:38:19.608+08:00 level=INFO source=sched.go:452 msg="loaded runners" count=1 time=2025-05-16T23:38:19.608+08:00 level=INFO source=server.go:589 msg="waiting for llama runner to start responding" time=2025-05-16T23:38:19.610+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server error" time=2025-05-16T23:38:19.631+08:00 level=INFO source=runner.go:853 msg="starting go runner" load_backend: loaded CPU backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\ggml-cpu-alderlake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-05-16T23:38:19.718+08:00 level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-05-16T23:38:19.719+08:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:53341" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080) - 19273 MiB free llama_model_loader: loaded meta data with 27 key-value pairs and 443 tensors from E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 14B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 14B llama_model_loader: - kv 5: qwen3.block_count u32 = 40 llama_model_loader: - kv 6: qwen3.context_length u32 = 40960 llama_model_loader: - kv 7: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 17408 llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 40 llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: general.file_type u32 = 15 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type f16: 40 tensors llama_model_loader: - type q4_K: 221 tensors llama_model_loader: - type q6_K: 21 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 8.63 GiB (5.02 BPW) time=2025-05-16T23:38:19.860+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model" load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 0 print_info: n_ctx_train = 40960 print_info: n_embd = 5120 print_info: n_layer = 40 print_info: n_head = 40 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 5 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 17408 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 40960 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 14B print_info: model params = 14.77 B print_info: general.name = Qwen3 14B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: offloading 40 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 41/41 layers to GPU load_tensors: CPU model buffer size = 417.30 MiB load_tensors: CUDA0 model buffer size = 8423.47 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 32768 llama_context: n_ctx_per_seq = 32768 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 1 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (32768) < n_ctx_train (40960) -- the full capacity of the model will not be utilized llama_context: CUDA_Host output buffer size = 0.60 MiB init: kv_size = 32768, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 40, can_shift = 1 init: CUDA0 KV buffer size = 2720.00 MiB llama_context: KV self size = 2720.00 MiB, K (q8_0): 1360.00 MiB, V (q8_0): 1360.00 MiB llama_context: CUDA0 compute buffer size = 306.75 MiB llama_context: CUDA_Host compute buffer size = 74.01 MiB llama_context: graph nodes = 1367 llama_context: graph splits = 2 time=2025-05-16T23:38:24.372+08:00 level=INFO source=server.go:628 msg="llama runner started in 4.76 seconds" [GIN] 2025/05/16 - 23:38:24 | 200 | 5.5190813s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/05/16 - 23:38:27 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/05/16 - 23:38:27 | 200 | 0s | 127.0.0.1 | GET "/api/ps" ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.8
GiteaMirror added the bug label 2026-04-12 18:58:06 -05:00
Author
Owner

@rick-github commented on GitHub (May 16, 2025):

Add OLLAMA_DEBUG=1 to the server environment to show scheduling decisions.

<!-- gh-comment-id:2887102694 --> @rick-github commented on GitHub (May 16, 2025): Add `OLLAMA_DEBUG=1` to the server environment to show scheduling decisions.
Author
Owner

@NGC13009 commented on GitHub (May 16, 2025):

I just conducted several rounds of testing and found that if the context length is configured through environment variables, there will be no issues.

But if I use modelfile to redefine a customized context length model, when this model runs, other models will be stopped. (i also manual num_gpu)

PARAMETER num_gpu 999
PARAMETER num_ctx 16384

So, I think it should be a bug in the custom model. If the context (or, maybe the num-gpu) is specified in the model configuration file, it can cause this issue.

here is the ollama log with debug mode:

[app time]	2025 - 05 - 17 	00:03:03	--- 


[app info]	ENV: {'ALLUSERSPROFILE': 'C:\\ProgramData', 'APPDATA': 'C:\\Users\\ngc13\\AppData\\Roaming', 'CHOCOLATEYINSTALL': 'C:\\ProgramData\\chocolatey', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files', 'COMMONPROGRAMFILES(X86)': 'C:\\Program Files (x86)\\Common Files', 'COMMONPROGRAMW6432': 'C:\\Program Files\\Common Files', 'COMPUTERNAME': 'LAPTOP-AYJL9', 'COMSPEC': 'C:\\WINDOWS\\system32\\cmd.exe', 'CUDA_HOME': 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1', 'CUDA_PATH': 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1', 'CUDA_PATH_V12_1': 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1', 'C_INCLUDEDE_PATH': 'C:\\msys2\\mingw64\\include', 'DRIVERDATA': 'C:\\Windows\\System32\\Drivers\\DriverData', 'EFC_28168_2283032206': '1', 'EFC_28168_2775293581': '1', 'EFC_28168_3789132940': '1', 'FPS_BROWSER_APP_PROFILE_STRING': 'Internet Explorer', 'FPS_BROWSER_USER_PROFILE_STRING': 'Default', 'GOPATH': 'C:\\Users\\ngc13\\go', 'HOMEDRIVE': 'C:', 'HOMEPATH': '\\Users\\ngc13', 'INCLUDE': 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.36.32532\\include;C:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22000.0\\shared;C:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22000.0\\ucrt;C:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22000.0\\um;C:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22000.0\\winrt;', 'LIB': 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.36.32532\\lib\\x64;C:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22000.0\\ucrt\\x64;C:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22000.0\\um\\x64;C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\um\\x64;C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\ucrt\\x64;', 'LIBRARY_PATH': 'C:\\msys2\\mingw64\\lib', 'LOCALAPPDATA': 'C:\\Users\\ngc13\\AppData\\Local', 'LOGONSERVER': '\\\\LAPTOP-AYJL9', 'NUMBER_OF_PROCESSORS': '32', 'ONEDRIVE': 'C:\\Users\\ngc13\\OneDrive', 'ONLINESERVICES': 'Online Services', 'OS': 'Windows_NT', 'PATH': 'C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\;C:\\Qt\\6.8.1\\mingw_64\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\java8path;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath;C:\\windows\\system32;C:\\windows;C:\\windows\\System32\\Wbem;C:\\windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\windows\\System32\\OpenSSH\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Administrator\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Program Files\\HP\\OMEN-Broadcast\\Common;C:\\Program Files\\Microsoft VS Code\\bin;C:\\Program Files\\MATLAB\\R2024b\\runtime\\win64;C:\\Program Files\\MATLAB\\R2024b\\bin;C:\\ffmpeg\\bin;C:\\Program Files (x86)\\Wolfram Research\\WolframScript\\;C:\\Program Files\\dotnet\\;C:\\msys2;C:\\texlive\\2023\\bin\\windows;C:\\Program Files (x86)\\GnuPG\\bin;C:\\msys2\\mingw64\\bin;C:\\miniconda;C:\\miniconda\\Scripts;C:\\miniconda\\Library\\bin;C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.36.32532\\bin\\Hostx64\\x64;C:\\Program Files (x86)\\MATLAB\\MATLAB Runtime\\v851\\runtime\\win32;C:\\Program Files\\Git\\cmd;C:\\application\\syspath;C:\\Program Files\\nodejs\\;C:\\ProgramData\\chocolatey\\bin;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Systems 2025.1.1\\target-windows-x64;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\application\\ollama\\OLLAMA_FILE;C:\\Program Files\\Go\\bin;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Users\\ngc13\\scoop\\shims;C:\\Users\\ngc13\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\ngc13\\AppData\\Local\\Programs\\Ollama;C:\\Users\\ngc13\\AppData\\Roaming\\npm;C:\\Users\\ngc13\\go\\bin', 'PATHEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC', 'PLATFORMCODE': 'M7', 'PROCESSOR_ARCHITECTURE': 'AMD64', 'PROCESSOR_IDENTIFIER': 'Intel64 Family 6 Model 183 Stepping 1, GenuineIntel', 'PROCESSOR_LEVEL': '6', 'PROCESSOR_REVISION': 'b701', 'PROGRAMDATA': 'C:\\ProgramData', 'PROGRAMFILES': 'C:\\Program Files', 'PROGRAMFILES(X86)': 'C:\\Program Files (x86)', 'PROGRAMW6432': 'C:\\Program Files', 'PSMODULEPATH': '%ProgramFiles%\\WindowsPowerShell\\Modules;C:\\WINDOWS\\system32\\WindowsPowerShell\\v1.0\\Modules', 'PUBLIC': 'C:\\Users\\Public', 'REGIONCODE': 'APJ', 'SESSIONNAME': 'Console', 'SYSTEMDRIVE': 'C:', 'SYSTEMROOT': 'C:\\WINDOWS', 'TEMP': 'C:\\Users\\ngc13\\AppData\\Local\\Temp', 'TMP': 'C:\\Users\\ngc13\\AppData\\Local\\Temp', 'USERDOMAIN': 'LAPTOP-AYJL9', 'USERDOMAIN_ROAMINGPROFILE': 'LAPTOP-AYJL9', 'USERNAME': 'ngc13', 'USERPROFILE': 'C:\\Users\\ngc13', 'WINDIR': 'C:\\WINDOWS', 'ZES_ENABLE_SYSMAN': '1', '_PYI_ARCHIVE_FILE': 'E:\\project_file\\limitless\\ollama-launcher\\dist\\ollama_launcher\\ollama_launcher.exe', '_PYI_PARENT_PROCESS_LEVEL': '1', '__COMPAT_LAYER': 'DetectorsAppHealth', 'TCL_LIBRARY': 'E:\\project_file\\limitless\\ollama-launcher\\dist\\ollama_launcher\\_internal\\_tcl_data', 'TK_LIBRARY': 'E:\\project_file\\limitless\\ollama-launcher\\dist\\ollama_launcher\\_internal\\_tk_data', 'OLLAMA_MODELS': 'E:/LLM/ollama_models', 'OLLAMA_TMPDIR': 'E:/LLM/ollama_models/temp', 'OLLAMA_HOST': '127.0.0.1:11434', 'OLLAMA_ORIGINS': '*', 'OLLAMA_CONTEXT_LENGTH': '32768', 'OLLAMA_KV_CACHE_TYPE': 'q8_0', 'OLLAMA_KEEP_ALIVE': '-1', 'OLLAMA_MAX_QUEUE': '512', 'OLLAMA_NUM_PARALLEL': '1', 'OLLAMA_MAX_LOADED_MODELS': '3', 'OLLAMA_ENABLE_CUDA': '1', 'CUDA_VISIBLE_DEVICES': '0', 'OLLAMA_FLASH_ATTENTION': '1', 'OLLAMA_USE_MLOCK': '1', 'OLLAMA_MULTIUSER_CACHE': '0', 'OLLAMA_INTEL_GPU': '0', 'OLLAMA_DEBUG': '1'}
[app info]	ollama_dir: C:/application/ollama/OLLAMA_FILE
[app info]	Starting Ollama Server...
[app info]	Status: Ollama server running (PID: 29928)
[app time]	2025 - 05 - 17 	00:03:04	--- ollama server started.
2025/05/17 00:03:04 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:32768 OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:3 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:/LLM/ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-05-17T00:03:04.967+08:00 level=INFO source=images.go:463 msg="total blobs: 27"
time=2025-05-17T00:03:04.968+08:00 level=INFO source=images.go:470 msg="total unused blobs removed: 0"
time=2025-05-17T00:03:04.968+08:00 level=INFO source=routes.go:1300 msg="Listening on 127.0.0.1:11434 (version 0.6.8)"
time=2025-05-17T00:03:04.968+08:00 level=DEBUG source=sched.go:107 msg="starting llm scheduler"
time=2025-05-17T00:03:04.968+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-05-17T00:03:04.968+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-05-17T00:03:04.968+08:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-05-17T00:03:04.968+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32
time=2025-05-17T00:03:04.968+08:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-05-17T00:03:04.968+08:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll
time=2025-05-17T00:03:04.968+08:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\nvml.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvml.dll C:\\Qt\\6.8.1\\mingw_64\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp\\nvml.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvml.dll C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\java8path\\nvml.dll C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath\\nvml.dll C:\\windows\\system32\\nvml.dll C:\\windows\\nvml.dll C:\\windows\\System32\\Wbem\\nvml.dll C:\\windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Users\\Administrator\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Program Files\\HP\\OMEN-Broadcast\\Common\\nvml.dll C:\\Program Files\\Microsoft VS Code\\bin\\nvml.dll C:\\Program Files\\MATLAB\\R2024b\\runtime\\win64\\nvml.dll C:\\Program Files\\MATLAB\\R2024b\\bin\\nvml.dll C:\\ffmpeg\\bin\\nvml.dll C:\\Program Files (x86)\\Wolfram Research\\WolframScript\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\msys2\\nvml.dll C:\\texlive\\2023\\bin\\windows\\nvml.dll C:\\Program Files (x86)\\GnuPG\\bin\\nvml.dll C:\\msys2\\mingw64\\bin\\nvml.dll C:\\miniconda\\nvml.dll C:\\miniconda\\Scripts\\nvml.dll C:\\miniconda\\Library\\bin\\nvml.dll C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.36.32532\\bin\\Hostx64\\x64\\nvml.dll C:\\Program Files (x86)\\MATLAB\\MATLAB Runtime\\v851\\runtime\\win32\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\application\\syspath\\nvml.dll C:\\Program Files\\nodejs\\nvml.dll C:\\ProgramData\\chocolatey\\bin\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Systems 2025.1.1\\target-windows-x64\\nvml.dll C:\\WINDOWS\\system32\\nvml.dll C:\\WINDOWS\\nvml.dll C:\\WINDOWS\\System32\\Wbem\\nvml.dll C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\WINDOWS\\System32\\OpenSSH\\nvml.dll C:\\application\\ollama\\OLLAMA_FILE\\nvml.dll C:\\Program Files\\Go\\bin\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR\\nvml.dll C:\\Users\\ngc13\\scoop\\shims\\nvml.dll C:\\Users\\ngc13\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\ngc13\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\ngc13\\AppData\\Roaming\\npm\\nvml.dll C:\\Users\\ngc13\\go\\bin\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2025-05-17T00:03:04.968+08:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll"
time=2025-05-17T00:03:04.969+08:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\windows\\system32\\nvml.dll C:\\WINDOWS\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2025-05-17T00:03:04.984+08:00 level=DEBUG source=gpu.go:111 msg="nvidia-ml loaded" library=C:\windows\system32\nvml.dll
time=2025-05-17T00:03:04.984+08:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvcuda.dll
time=2025-05-17T00:03:04.984+08:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\nvcuda.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvcuda.dll C:\\Qt\\6.8.1\\mingw_64\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp\\nvcuda.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\java8path\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath\\nvcuda.dll C:\\windows\\system32\\nvcuda.dll C:\\windows\\nvcuda.dll C:\\windows\\System32\\Wbem\\nvcuda.dll C:\\windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Users\\Administrator\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Program Files\\HP\\OMEN-Broadcast\\Common\\nvcuda.dll C:\\Program Files\\Microsoft VS Code\\bin\\nvcuda.dll C:\\Program Files\\MATLAB\\R2024b\\runtime\\win64\\nvcuda.dll C:\\Program Files\\MATLAB\\R2024b\\bin\\nvcuda.dll C:\\ffmpeg\\bin\\nvcuda.dll C:\\Program Files (x86)\\Wolfram Research\\WolframScript\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\msys2\\nvcuda.dll C:\\texlive\\2023\\bin\\windows\\nvcuda.dll C:\\Program Files (x86)\\GnuPG\\bin\\nvcuda.dll C:\\msys2\\mingw64\\bin\\nvcuda.dll C:\\miniconda\\nvcuda.dll C:\\miniconda\\Scripts\\nvcuda.dll C:\\miniconda\\Library\\bin\\nvcuda.dll C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.36.32532\\bin\\Hostx64\\x64\\nvcuda.dll C:\\Program Files (x86)\\MATLAB\\MATLAB Runtime\\v851\\runtime\\win32\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\application\\syspath\\nvcuda.dll C:\\Program Files\\nodejs\\nvcuda.dll C:\\ProgramData\\chocolatey\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Systems 2025.1.1\\target-windows-x64\\nvcuda.dll C:\\WINDOWS\\system32\\nvcuda.dll C:\\WINDOWS\\nvcuda.dll C:\\WINDOWS\\System32\\Wbem\\nvcuda.dll C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\WINDOWS\\System32\\OpenSSH\\nvcuda.dll C:\\application\\ollama\\OLLAMA_FILE\\nvcuda.dll C:\\Program Files\\Go\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR\\nvcuda.dll C:\\Users\\ngc13\\scoop\\shims\\nvcuda.dll C:\\Users\\ngc13\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\ngc13\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\ngc13\\AppData\\Roaming\\npm\\nvcuda.dll C:\\Users\\ngc13\\go\\bin\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]"
time=2025-05-17T00:03:04.984+08:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll"
time=2025-05-17T00:03:04.985+08:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\windows\\system32\\nvcuda.dll C:\\WINDOWS\\system32\\nvcuda.dll]"
initializing C:\windows\system32\nvcuda.dll
dlsym: cuInit - 00007FFFE76B1F80
dlsym: cuDriverGetVersion - 00007FFFE76B2020
dlsym: cuDeviceGetCount - 00007FFFE76B2816
dlsym: cuDeviceGet - 00007FFFE76B2810
dlsym: cuDeviceGetAttribute - 00007FFFE76B2170
dlsym: cuDeviceGetUuid - 00007FFFE76B2822
dlsym: cuDeviceGetName - 00007FFFE76B281C
dlsym: cuCtxCreate_v3 - 00007FFFE76B2894
dlsym: cuMemGetInfo_v2 - 00007FFFE76B2996
dlsym: cuCtxDestroy - 00007FFFE76B28A6
calling cuInit
calling cuDriverGetVersion
raw version 0x2f3a
CUDA driver version: 12.9
calling cuDeviceGetCount
device count 1
time=2025-05-17T00:03:04.998+08:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=C:\windows\system32\nvcuda.dll
[GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a] CUDA totalMem 20479 mb
[GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a] CUDA freeMem 19273 mb
[GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a] Compute Capability 8.6
time=2025-05-17T00:03:05.149+08:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found."
releasing cuda driver library
releasing nvml library
time=2025-05-17T00:03:05.150+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3080" total="20.0 GiB" available="18.8 GiB"
[GIN] 2025/05/17 - 00:03:09 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/05/17 - 00:03:09 | 200 |       532.6µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/05/17 - 00:03:18 | 200 |            0s |       127.0.0.1 | HEAD     "/"
time=2025-05-17T00:03:18.473+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-17T00:03:18.489+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
[GIN] 2025/05/17 - 00:03:18 | 200 |     27.4373ms |       127.0.0.1 | POST     "/api/show"
time=2025-05-17T00:03:18.508+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-17T00:03:18.508+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.3 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB"
time=2025-05-17T00:03:18.631+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.8 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB"
releasing nvml library
time=2025-05-17T00:03:18.640+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-17T00:03:18.648+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-17T00:03:18.648+08:00 level=DEBUG source=sched.go:227 msg="loading first model" model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e
time=2025-05-17T00:03:18.648+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[18.1 GiB]"
time=2025-05-17T00:03:18.648+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0
time=2025-05-17T00:03:18.648+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB"
time=2025-05-17T00:03:18.677+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB"
releasing nvml library
time=2025-05-17T00:03:18.680+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[18.1 GiB]"
time=2025-05-17T00:03:18.680+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0
time=2025-05-17T00:03:18.680+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB"
time=2025-05-17T00:03:18.724+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB"
releasing nvml library
time=2025-05-17T00:03:18.725+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB"
time=2025-05-17T00:03:18.754+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB"
releasing nvml library
time=2025-05-17T00:03:18.755+08:00 level=INFO source=server.go:106 msg="system memory" total="63.7 GiB" free="46.2 GiB" free_swap="43.4 GiB"
time=2025-05-17T00:03:18.755+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[18.1 GiB]"
time=2025-05-17T00:03:18.755+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0
time=2025-05-17T00:03:18.755+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB"
time=2025-05-17T00:03:18.785+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB"
releasing nvml library
time=2025-05-17T00:03:18.786+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=999 layers.model=41 layers.offload=41 layers.split="" memory.available="[18.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.2 GiB" memory.required.partial="11.2 GiB" memory.required.kv="1.2 GiB" memory.required.allocations="[11.2 GiB]" memory.weights.total="8.2 GiB" memory.weights.repeating="7.6 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="1.0 GiB" memory.graph.partial="1.0 GiB"
time=2025-05-17T00:03:18.786+08:00 level=INFO source=server.go:186 msg="enabling flash attention"
time=2025-05-17T00:03:18.787+08:00 level=DEBUG source=server.go:263 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]"
llama_model_loader: loaded meta data with 27 key-value pairs and 443 tensors from E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 14B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 14B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 40
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 17408
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 40
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type  f16:   40 tensors
llama_model_loader: - type q4_K:  221 tensors
llama_model_loader: - type q6_K:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 8.63 GiB (5.02 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 14.77 B
print_info: general.name     = Qwen3 14B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-05-17T00:03:18.924+08:00 level=DEBUG source=server.go:339 msg="adding gpu library" path=C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12
time=2025-05-17T00:03:18.924+08:00 level=DEBUG source=server.go:346 msg="adding gpu dependency paths" paths=[C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12]
time=2025-05-17T00:03:18.924+08:00 level=INFO source=server.go:410 msg="starting llama server" cmd="C:\\application\\ollama\\OLLAMA_FILE\\ollama.exe runner --model E:\\LLM\\ollama_models\\blobs\\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e --ctx-size 16384 --batch-size 512 --n-gpu-layers 999 --verbose --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --port 59210"
time=2025-05-17T00:03:18.924+08:00 level=DEBUG source=server.go:429 msg=subprocess environment="[CUDA_HOME=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_1=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_VISIBLE_DEVICES=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_DEBUG=1 OLLAMA_ENABLE_CUDA=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=127.0.0.1:11434 OLLAMA_INTEL_GPU=0 OLLAMA_KEEP_ALIVE=-1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_MAX_QUEUE=512 OLLAMA_MODELS=E:/LLM/ollama_models OLLAMA_MULTIUSER_CACHE=0 OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* OLLAMA_TMPDIR=E:/LLM/ollama_models/temp OLLAMA_USE_MLOCK=1 PATH=C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\cuda_v12;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\cuda_v12;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama;C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\;C:\\Qt\\6.8.1\\mingw_64\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\java8path;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath;C:\\windows\\system32;C:\\windows;C:\\windows\\System32\\Wbem;C:\\windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\windows\\System32\\OpenSSH\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Administrator\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Program Files\\HP\\OMEN-Broadcast\\Common;C:\\Program Files\\Microsoft VS Code\\bin;C:\\Program Files\\MATLAB\\R2024b\\runtime\\win64;C:\\Program Files\\MATLAB\\R2024b\\bin;C:\\ffmpeg\\bin;C:\\Program Files (x86)\\Wolfram Research\\WolframScript\\;C:\\Program Files\\dotnet\\;C:\\msys2;C:\\texlive\\2023\\bin\\windows;C:\\Program Files (x86)\\GnuPG\\bin;C:\\msys2\\mingw64\\bin;C:\\miniconda;C:\\miniconda\\Scripts;C:\\miniconda\\Library\\bin;C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.36.32532\\bin\\Hostx64\\x64;C:\\Program Files (x86)\\MATLAB\\MATLAB Runtime\\v851\\runtime\\win32;C:\\Program Files\\Git\\cmd;C:\\application\\syspath;C:\\Program Files\\nodejs\\;C:\\ProgramData\\chocolatey\\bin;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Systems 2025.1.1\\target-windows-x64;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\application\\ollama\\OLLAMA_FILE;C:\\Program Files\\Go\\bin;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Users\\ngc13\\scoop\\shims;C:\\Users\\ngc13\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\ngc13\\AppData\\Local\\Programs\\Ollama;C:\\Users\\ngc13\\AppData\\Roaming\\npm;C:\\Users\\ngc13\\go\\bin;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama OLLAMA_LIBRARY_PATH=C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\cuda_v12]"
time=2025-05-17T00:03:18.928+08:00 level=INFO source=sched.go:452 msg="loaded runners" count=1
time=2025-05-17T00:03:18.928+08:00 level=INFO source=server.go:589 msg="waiting for llama runner to start responding"
time=2025-05-17T00:03:18.928+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server error"
time=2025-05-17T00:03:18.950+08:00 level=INFO source=runner.go:853 msg="starting go runner"
time=2025-05-17T00:03:18.954+08:00 level=DEBUG source=ggml.go:93 msg="ggml backend load all from path" path=C:\application\ollama\OLLAMA_FILE\lib\ollama
load_backend: loaded CPU backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\ggml-cpu-alderlake.dll
time=2025-05-17T00:03:18.968+08:00 level=DEBUG source=ggml.go:93 msg="ggml backend load all from path" path=C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-05-17T00:03:19.041+08:00 level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-05-17T00:03:19.043+08:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:59210"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080) - 19273 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 443 tensors from E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 14B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 14B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 40
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 17408
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 40
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2025-05-17T00:03:19.180+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type  f16:   40 tensors
llama_model_loader: - type q4_K:  221 tensors
llama_model_loader: - type q6_K:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 8.63 GiB (5.02 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 5120
print_info: n_layer          = 40
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 17408
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 14B
print_info: model params     = 14.77 B
print_info: general.name     = Qwen3 14B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
load_tensors: layer   1 assigned to device CUDA0, is_swa = 0
load_tensors: layer   2 assigned to device CUDA0, is_swa = 0
load_tensors: layer   3 assigned to device CUDA0, is_swa = 0
load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
load_tensors: layer  22 assigned to device CUDA0, is_swa = 0
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 0
load_tensors: layer  25 assigned to device CUDA0, is_swa = 0
load_tensors: layer  26 assigned to device CUDA0, is_swa = 0
load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
load_tensors: layer  28 assigned to device CUDA0, is_swa = 0
load_tensors: layer  29 assigned to device CUDA0, is_swa = 0
load_tensors: layer  30 assigned to device CUDA0, is_swa = 0
load_tensors: layer  31 assigned to device CUDA0, is_swa = 0
load_tensors: layer  32 assigned to device CUDA0, is_swa = 0
load_tensors: layer  33 assigned to device CUDA0, is_swa = 0
load_tensors: layer  34 assigned to device CUDA0, is_swa = 0
load_tensors: layer  35 assigned to device CUDA0, is_swa = 0
load_tensors: layer  36 assigned to device CUDA0, is_swa = 0
load_tensors: layer  37 assigned to device CUDA0, is_swa = 0
load_tensors: layer  38 assigned to device CUDA0, is_swa = 0
load_tensors: layer  39 assigned to device CUDA0, is_swa = 0
load_tensors: layer  40 assigned to device CUDA0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:          CPU model buffer size =   417.30 MiB
load_tensors:        CUDA0 model buffer size =  8423.47 MiB
load_all_data: no device found for buffer type CPU for async uploads
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
time=2025-05-17T00:03:19.431+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.05"
time=2025-05-17T00:03:19.932+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.14"
time=2025-05-17T00:03:20.183+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.20"
time=2025-05-17T00:03:20.433+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.26"
time=2025-05-17T00:03:20.685+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.32"
time=2025-05-17T00:03:20.935+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.37"
time=2025-05-17T00:03:21.186+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.43"
time=2025-05-17T00:03:21.436+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.49"
time=2025-05-17T00:03:21.687+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.54"
time=2025-05-17T00:03:21.938+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.61"
time=2025-05-17T00:03:22.188+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.67"
time=2025-05-17T00:03:22.438+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.72"
time=2025-05-17T00:03:22.689+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.78"
time=2025-05-17T00:03:22.940+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.84"
time=2025-05-17T00:03:23.190+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.89"
time=2025-05-17T00:03:23.441+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.96"
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     0.60 MiB
llama_context: n_ctx = 16384
llama_context: n_ctx = 16384 (padded)
init: kv_size = 16384, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 40, can_shift = 1
init: layer   0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init:      CUDA0 KV buffer size =  1360.00 MiB
llama_context: KV self size  = 1360.00 MiB, K (q8_0):  680.00 MiB, V (q8_0):  680.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      CUDA0 compute buffer size =   306.75 MiB
llama_context:  CUDA_Host compute buffer size =    42.01 MiB
llama_context: graph nodes  = 1367
llama_context: graph splits = 2
time=2025-05-17T00:03:23.692+08:00 level=INFO source=server.go:628 msg="llama runner started in 4.76 seconds"
time=2025-05-17T00:03:23.692+08:00 level=DEBUG source=sched.go:464 msg="finished setting up runner" model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e
[GIN] 2025/05/17 - 00:03:23 | 200 |    5.1957925s |       127.0.0.1 | POST     "/api/generate"
time=2025-05-17T00:03:23.692+08:00 level=DEBUG source=sched.go:472 msg="context for request finished"
time=2025-05-17T00:03:23.692+08:00 level=DEBUG source=sched.go:342 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3:14b-16k runner.inference=cuda runner.devices=1 runner.size="11.2 GiB" runner.vram="11.2 GiB" runner.num_ctx=16384 runner.parallel=1 runner.pid=28284 runner.model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e duration=2562047h47m16.854775807s
time=2025-05-17T00:03:23.692+08:00 level=DEBUG source=sched.go:360 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3:14b-16k runner.inference=cuda runner.devices=1 runner.size="11.2 GiB" runner.vram="11.2 GiB" runner.num_ctx=16384 runner.parallel=1 runner.pid=28284 runner.model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e refCount=0
[GIN] 2025/05/17 - 00:03:30 | 200 |            0s |       127.0.0.1 | HEAD     "/"
time=2025-05-17T00:03:30.819+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-17T00:03:30.830+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
[GIN] 2025/05/17 - 00:03:30 | 200 |     18.9695ms |       127.0.0.1 | POST     "/api/show"
time=2025-05-17T00:03:30.845+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-17T00:03:30.846+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="45.4 GiB" now.free_swap="32.0 GiB"
time=2025-05-17T00:03:30.964+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="8.1 GiB" now.used="11.9 GiB"
releasing nvml library
time=2025-05-17T00:03:30.978+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-17T00:03:30.985+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-17T00:03:30.986+08:00 level=DEBUG source=sched.go:506 msg="gpu reported" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a library=cuda available="8.1 GiB"
time=2025-05-17T00:03:30.986+08:00 level=INFO source=sched.go:517 msg="updated VRAM based on existing loaded models" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a library=cuda total="20.0 GiB" available="8.1 GiB"
time=2025-05-17T00:03:30.986+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[8.1 GiB]"
time=2025-05-17T00:03:30.986+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.vision.block_count default=0
time=2025-05-17T00:03:30.986+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="45.4 GiB" before.free_swap="32.0 GiB" now.total="63.7 GiB" now.free="45.4 GiB" now.free_swap="32.0 GiB"
time=2025-05-17T00:03:31.026+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="8.1 GiB" now.total="20.0 GiB" now.free="8.1 GiB" now.used="11.9 GiB"
releasing nvml library
time=2025-05-17T00:03:31.028+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-05-17T00:03:31.029+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-05-17T00:03:31.029+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-05-17T00:03:31.029+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-05-17T00:03:31.029+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[8.1 GiB]"
time=2025-05-17T00:03:31.029+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.vision.block_count default=0
time=2025-05-17T00:03:31.030+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="45.4 GiB" before.free_swap="32.0 GiB" now.total="63.7 GiB" now.free="45.4 GiB" now.free_swap="32.0 GiB"
time=2025-05-17T00:03:31.073+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="8.1 GiB" now.total="20.0 GiB" now.free="8.1 GiB" now.used="11.9 GiB"
releasing nvml library
time=2025-05-17T00:03:31.074+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-05-17T00:03:31.074+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-05-17T00:03:31.074+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-05-17T00:03:31.074+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-05-17T00:03:31.074+08:00 level=DEBUG source=sched.go:824 msg="found an idle runner to unload" runner.name=registry.ollama.ai/library/qwen3:14b-16k runner.inference=cuda runner.devices=1 runner.size="11.2 GiB" runner.vram="11.2 GiB" runner.num_ctx=16384 runner.parallel=1 runner.pid=28284 runner.model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e
time=2025-05-17T00:03:31.074+08:00 level=DEBUG source=sched.go:286 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/qwen3:14b-16k runner.inference=cuda runner.devices=1 runner.size="11.2 GiB" runner.vram="11.2 GiB" runner.num_ctx=16384 runner.parallel=1 runner.pid=28284 runner.model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e refCount=0
time=2025-05-17T00:03:31.074+08:00 level=DEBUG source=sched.go:299 msg="waiting for pending requests to complete and unload to occur" modelPath=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e
time=2025-05-17T00:03:31.074+08:00 level=DEBUG source=sched.go:363 msg="runner expired event received" runner.name=registry.ollama.ai/library/qwen3:14b-16k runner.inference=cuda runner.devices=1 runner.size="11.2 GiB" runner.vram="11.2 GiB" runner.num_ctx=16384 runner.parallel=1 runner.pid=28284 runner.model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e
time=2025-05-17T00:03:31.074+08:00 level=DEBUG source=sched.go:378 msg="got lock to unload" runner.name=registry.ollama.ai/library/qwen3:14b-16k runner.inference=cuda runner.devices=1 runner.size="11.2 GiB" runner.vram="11.2 GiB" runner.num_ctx=16384 runner.parallel=1 runner.pid=28284 runner.model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e
time=2025-05-17T00:03:31.074+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="45.4 GiB" before.free_swap="32.0 GiB" now.total="63.7 GiB" now.free="45.4 GiB" now.free_swap="32.0 GiB"
time=2025-05-17T00:03:31.104+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="8.1 GiB" now.total="20.0 GiB" now.free="8.1 GiB" now.used="11.9 GiB"
releasing nvml library
time=2025-05-17T00:03:31.121+08:00 level=DEBUG source=server.go:1017 msg="stopping llama server"
time=2025-05-17T00:03:31.121+08:00 level=DEBUG source=server.go:1023 msg="waiting for llama server to exit"
time=2025-05-17T00:03:31.248+08:00 level=DEBUG source=server.go:1027 msg="llama server stopped"
time=2025-05-17T00:03:31.248+08:00 level=DEBUG source=sched.go:383 msg="runner released" runner="LogValue panicked\ncalled from runtime.panicmem (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/panic.go:262)\ncalled from runtime.sigpanic (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/signal_windows.go:401)\ncalled from github.com/ollama/ollama/server.(*runnerRef).LogValue (C:/a/ollama/ollama/server/sched.go:694)\ncalled from log/slog.Value.Resolve (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/value.go:512)\ncalled from log/slog.(*handleState).appendAttr (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/handler.go:468)\n(rest of stack elided)\n"
time=2025-05-17T00:03:31.355+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="45.4 GiB" before.free_swap="32.0 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB"
time=2025-05-17T00:03:31.397+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="8.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB"
releasing nvml library
time=2025-05-17T00:03:31.398+08:00 level=DEBUG source=sched.go:668 msg="gpu VRAM free memory converged after 0.32 seconds" model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e
time=2025-05-17T00:03:31.398+08:00 level=DEBUG source=sched.go:387 msg="sending an unloaded event" runner="LogValue panicked\ncalled from runtime.panicmem (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/panic.go:262)\ncalled from runtime.sigpanic (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/signal_windows.go:401)\ncalled from github.com/ollama/ollama/server.(*runnerRef).LogValue (C:/a/ollama/ollama/server/sched.go:694)\ncalled from log/slog.Value.Resolve (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/value.go:512)\ncalled from log/slog.(*handleState).appendAttr (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/handler.go:468)\n(rest of stack elided)\n"
time=2025-05-17T00:03:31.399+08:00 level=DEBUG source=sched.go:305 msg="unload completed" modelPath=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e
time=2025-05-17T00:03:31.399+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB"
time=2025-05-17T00:03:31.428+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB"
releasing nvml library
time=2025-05-17T00:03:31.439+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-17T00:03:31.451+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-17T00:03:31.451+08:00 level=DEBUG source=sched.go:227 msg="loading first model" model=E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104
time=2025-05-17T00:03:31.451+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[18.1 GiB]"
time=2025-05-17T00:03:31.451+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.vision.block_count default=0
time=2025-05-17T00:03:31.451+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB"
time=2025-05-17T00:03:31.490+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB"
releasing nvml library
time=2025-05-17T00:03:31.491+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-05-17T00:03:31.491+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-05-17T00:03:31.491+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-05-17T00:03:31.491+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-05-17T00:03:31.491+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[18.1 GiB]"
time=2025-05-17T00:03:31.491+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.vision.block_count default=0
time=2025-05-17T00:03:31.491+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB"
time=2025-05-17T00:03:31.521+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB"
releasing nvml library
time=2025-05-17T00:03:31.523+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-05-17T00:03:31.523+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-05-17T00:03:31.523+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-05-17T00:03:31.523+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-05-17T00:03:31.523+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB"
time=2025-05-17T00:03:31.552+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB"
releasing nvml library
time=2025-05-17T00:03:31.554+08:00 level=INFO source=server.go:106 msg="system memory" total="63.7 GiB" free="46.2 GiB" free_swap="43.4 GiB"
time=2025-05-17T00:03:31.554+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[18.1 GiB]"
time=2025-05-17T00:03:31.555+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.vision.block_count default=0
time=2025-05-17T00:03:31.555+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB"
time=2025-05-17T00:03:31.584+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB"
releasing nvml library
time=2025-05-17T00:03:31.584+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-05-17T00:03:31.584+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-05-17T00:03:31.584+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-05-17T00:03:31.584+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-05-17T00:03:31.584+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=999 layers.model=29 layers.offload=29 layers.split="" memory.available="[18.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.7 GiB" memory.required.partial="2.7 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[2.7 GiB]" memory.weights.total="934.7 MiB" memory.weights.repeating="752.1 MiB" memory.weights.nonrepeating="182.6 MiB" memory.graph.full="838.0 MiB" memory.graph.partial="1.0 GiB"
time=2025-05-17T00:03:31.584+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-05-17T00:03:31.584+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-05-17T00:03:31.585+08:00 level=INFO source=server.go:186 msg="enabling flash attention"
time=2025-05-17T00:03:31.585+08:00 level=DEBUG source=server.go:263 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]"
llama_model_loader: loaded meta data with 34 key-value pairs and 338 tensors from E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 Coder 1.5B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 1.5B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 Coder 1.5B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv  12:                               general.tags arr[str,6]       = ["code", "codeqwen", "chat", "qwen", ...
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 28
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                          general.file_type u32              = 15
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 934.69 MiB (5.08 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 1.54 B
print_info: general.name     = Qwen2.5 Coder 1.5B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-05-17T00:03:31.737+08:00 level=DEBUG source=server.go:339 msg="adding gpu library" path=C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12
time=2025-05-17T00:03:31.737+08:00 level=DEBUG source=server.go:346 msg="adding gpu dependency paths" paths=[C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12]
time=2025-05-17T00:03:31.737+08:00 level=INFO source=server.go:410 msg="starting llama server" cmd="C:\\application\\ollama\\OLLAMA_FILE\\ollama.exe runner --model E:\\LLM\\ollama_models\\blobs\\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 --ctx-size 32768 --batch-size 512 --n-gpu-layers 999 --verbose --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --port 59218"
time=2025-05-17T00:03:31.737+08:00 level=DEBUG source=server.go:429 msg=subprocess environment="[CUDA_HOME=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_1=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_VISIBLE_DEVICES=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_DEBUG=1 OLLAMA_ENABLE_CUDA=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=127.0.0.1:11434 OLLAMA_INTEL_GPU=0 OLLAMA_KEEP_ALIVE=-1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_MAX_QUEUE=512 OLLAMA_MODELS=E:/LLM/ollama_models OLLAMA_MULTIUSER_CACHE=0 OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* OLLAMA_TMPDIR=E:/LLM/ollama_models/temp OLLAMA_USE_MLOCK=1 PATH=C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\cuda_v12;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\cuda_v12;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama;C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\;C:\\Qt\\6.8.1\\mingw_64\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\java8path;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath;C:\\windows\\system32;C:\\windows;C:\\windows\\System32\\Wbem;C:\\windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\windows\\System32\\OpenSSH\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Administrator\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Program Files\\HP\\OMEN-Broadcast\\Common;C:\\Program Files\\Microsoft VS Code\\bin;C:\\Program Files\\MATLAB\\R2024b\\runtime\\win64;C:\\Program Files\\MATLAB\\R2024b\\bin;C:\\ffmpeg\\bin;C:\\Program Files (x86)\\Wolfram Research\\WolframScript\\;C:\\Program Files\\dotnet\\;C:\\msys2;C:\\texlive\\2023\\bin\\windows;C:\\Program Files (x86)\\GnuPG\\bin;C:\\msys2\\mingw64\\bin;C:\\miniconda;C:\\miniconda\\Scripts;C:\\miniconda\\Library\\bin;C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.36.32532\\bin\\Hostx64\\x64;C:\\Program Files (x86)\\MATLAB\\MATLAB Runtime\\v851\\runtime\\win32;C:\\Program Files\\Git\\cmd;C:\\application\\syspath;C:\\Program Files\\nodejs\\;C:\\ProgramData\\chocolatey\\bin;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Systems 2025.1.1\\target-windows-x64;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\application\\ollama\\OLLAMA_FILE;C:\\Program Files\\Go\\bin;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Users\\ngc13\\scoop\\shims;C:\\Users\\ngc13\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\ngc13\\AppData\\Local\\Programs\\Ollama;C:\\Users\\ngc13\\AppData\\Roaming\\npm;C:\\Users\\ngc13\\go\\bin;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama OLLAMA_LIBRARY_PATH=C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\cuda_v12]"
time=2025-05-17T00:03:31.742+08:00 level=INFO source=sched.go:452 msg="loaded runners" count=1
time=2025-05-17T00:03:31.742+08:00 level=INFO source=server.go:589 msg="waiting for llama runner to start responding"
time=2025-05-17T00:03:31.742+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server error"
time=2025-05-17T00:03:31.765+08:00 level=INFO source=runner.go:853 msg="starting go runner"
time=2025-05-17T00:03:31.769+08:00 level=DEBUG source=ggml.go:93 msg="ggml backend load all from path" path=C:\application\ollama\OLLAMA_FILE\lib\ollama
load_backend: loaded CPU backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\ggml-cpu-alderlake.dll
time=2025-05-17T00:03:31.780+08:00 level=DEBUG source=ggml.go:93 msg="ggml backend load all from path" path=C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-05-17T00:03:31.857+08:00 level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-05-17T00:03:31.858+08:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:59218"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080) - 19273 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 338 tensors from E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 Coder 1.5B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 1.5B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 Coder 1.5B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv  12:                               general.tags arr[str,6]       = ["code", "codeqwen", "chat", "qwen", ...
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 28
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                          general.file_type u32              = 15
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
time=2025-05-17T00:03:31.994+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 934.69 MiB (5.08 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 1536
print_info: n_layer          = 28
print_info: n_head           = 12
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 6
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8960
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1.5B
print_info: model params     = 1.54 B
print_info: general.name     = Qwen2.5 Coder 1.5B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
load_tensors: layer   1 assigned to device CUDA0, is_swa = 0
load_tensors: layer   2 assigned to device CUDA0, is_swa = 0
load_tensors: layer   3 assigned to device CUDA0, is_swa = 0
load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
load_tensors: layer  22 assigned to device CUDA0, is_swa = 0
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 0
load_tensors: layer  25 assigned to device CUDA0, is_swa = 0
load_tensors: layer  26 assigned to device CUDA0, is_swa = 0
load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
load_tensors: layer  28 assigned to device CUDA0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q6_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:          CPU model buffer size =   182.57 MiB
load_tensors:        CUDA0 model buffer size =   934.70 MiB
load_all_data: no device found for buffer type CPU for async uploads
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
time=2025-05-17T00:03:32.244+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.36"
time=2025-05-17T00:03:32.495+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.81"
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     0.59 MiB
llama_context: n_ctx = 32768
llama_context: n_ctx = 32768 (padded)
init: kv_size = 32768, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 28, can_shift = 1
init: layer   0: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer   1: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer   2: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer   3: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer   4: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer   5: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer   6: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer   7: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer   8: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer   9: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  10: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  11: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  12: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  13: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  14: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  15: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  16: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  17: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  18: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  19: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  20: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  21: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  22: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  23: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  24: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  25: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  26: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init: layer  27: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0
init:      CUDA0 KV buffer size =   476.00 MiB
llama_context: KV self size  =  476.00 MiB, K (q8_0):  238.00 MiB, V (q8_0):  238.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      CUDA0 compute buffer size =   299.75 MiB
llama_context:  CUDA_Host compute buffer size =    67.01 MiB
llama_context: graph nodes  = 931
llama_context: graph splits = 2
time=2025-05-17T00:03:32.745+08:00 level=INFO source=server.go:628 msg="llama runner started in 1.00 seconds"
time=2025-05-17T00:03:32.745+08:00 level=DEBUG source=sched.go:464 msg="finished setting up runner" model=E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104
time=2025-05-17T00:03:32.745+08:00 level=DEBUG source=sched.go:472 msg="context for request finished"
[GIN] 2025/05/17 - 00:03:32 | 200 |    1.9096022s |       127.0.0.1 | POST     "/api/generate"
time=2025-05-17T00:03:32.745+08:00 level=DEBUG source=sched.go:342 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen2.5-coder:1.5b-gpu runner.inference=cuda runner.devices=1 runner.size="2.7 GiB" runner.vram="2.7 GiB" runner.num_ctx=32768 runner.parallel=1 runner.pid=29412 runner.model=E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 duration=2562047h47m16.854775807s
time=2025-05-17T00:03:32.745+08:00 level=DEBUG source=sched.go:360 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen2.5-coder:1.5b-gpu runner.inference=cuda runner.devices=1 runner.size="2.7 GiB" runner.vram="2.7 GiB" runner.num_ctx=32768 runner.parallel=1 runner.pid=29412 runner.model=E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 refCount=0
[GIN] 2025/05/17 - 00:03:35 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/05/17 - 00:03:35 | 200 |       544.5µs |       127.0.0.1 | GET      "/api/ps"


<!-- gh-comment-id:2887131873 --> @NGC13009 commented on GitHub (May 16, 2025): I just conducted several rounds of testing and found that if the context length is configured through environment variables, there will be no issues. But if I use modelfile to redefine a customized context length model, when this model runs, other models will be stopped. (i also manual num_gpu) ```text PARAMETER num_gpu 999 PARAMETER num_ctx 16384 ``` So, I think it should be a bug in the custom model. If the context (or, maybe the num-gpu) is specified in the model configuration file, it can cause this issue. here is the ollama log with debug mode: ```log [app time] 2025 - 05 - 17 00:03:03 --- [app info] ENV: {'ALLUSERSPROFILE': 'C:\\ProgramData', 'APPDATA': 'C:\\Users\\ngc13\\AppData\\Roaming', 'CHOCOLATEYINSTALL': 'C:\\ProgramData\\chocolatey', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files', 'COMMONPROGRAMFILES(X86)': 'C:\\Program Files (x86)\\Common Files', 'COMMONPROGRAMW6432': 'C:\\Program Files\\Common Files', 'COMPUTERNAME': 'LAPTOP-AYJL9', 'COMSPEC': 'C:\\WINDOWS\\system32\\cmd.exe', 'CUDA_HOME': 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1', 'CUDA_PATH': 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1', 'CUDA_PATH_V12_1': 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1', 'C_INCLUDEDE_PATH': 'C:\\msys2\\mingw64\\include', 'DRIVERDATA': 'C:\\Windows\\System32\\Drivers\\DriverData', 'EFC_28168_2283032206': '1', 'EFC_28168_2775293581': '1', 'EFC_28168_3789132940': '1', 'FPS_BROWSER_APP_PROFILE_STRING': 'Internet Explorer', 'FPS_BROWSER_USER_PROFILE_STRING': 'Default', 'GOPATH': 'C:\\Users\\ngc13\\go', 'HOMEDRIVE': 'C:', 'HOMEPATH': '\\Users\\ngc13', 'INCLUDE': 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.36.32532\\include;C:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22000.0\\shared;C:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22000.0\\ucrt;C:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22000.0\\um;C:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22000.0\\winrt;', 'LIB': 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.36.32532\\lib\\x64;C:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22000.0\\ucrt\\x64;C:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22000.0\\um\\x64;C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\um\\x64;C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\ucrt\\x64;', 'LIBRARY_PATH': 'C:\\msys2\\mingw64\\lib', 'LOCALAPPDATA': 'C:\\Users\\ngc13\\AppData\\Local', 'LOGONSERVER': '\\\\LAPTOP-AYJL9', 'NUMBER_OF_PROCESSORS': '32', 'ONEDRIVE': 'C:\\Users\\ngc13\\OneDrive', 'ONLINESERVICES': 'Online Services', 'OS': 'Windows_NT', 'PATH': 'C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\;C:\\Qt\\6.8.1\\mingw_64\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\java8path;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath;C:\\windows\\system32;C:\\windows;C:\\windows\\System32\\Wbem;C:\\windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\windows\\System32\\OpenSSH\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Administrator\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Program Files\\HP\\OMEN-Broadcast\\Common;C:\\Program Files\\Microsoft VS Code\\bin;C:\\Program Files\\MATLAB\\R2024b\\runtime\\win64;C:\\Program Files\\MATLAB\\R2024b\\bin;C:\\ffmpeg\\bin;C:\\Program Files (x86)\\Wolfram Research\\WolframScript\\;C:\\Program Files\\dotnet\\;C:\\msys2;C:\\texlive\\2023\\bin\\windows;C:\\Program Files (x86)\\GnuPG\\bin;C:\\msys2\\mingw64\\bin;C:\\miniconda;C:\\miniconda\\Scripts;C:\\miniconda\\Library\\bin;C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.36.32532\\bin\\Hostx64\\x64;C:\\Program Files (x86)\\MATLAB\\MATLAB Runtime\\v851\\runtime\\win32;C:\\Program Files\\Git\\cmd;C:\\application\\syspath;C:\\Program Files\\nodejs\\;C:\\ProgramData\\chocolatey\\bin;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Systems 2025.1.1\\target-windows-x64;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\application\\ollama\\OLLAMA_FILE;C:\\Program Files\\Go\\bin;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Users\\ngc13\\scoop\\shims;C:\\Users\\ngc13\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\ngc13\\AppData\\Local\\Programs\\Ollama;C:\\Users\\ngc13\\AppData\\Roaming\\npm;C:\\Users\\ngc13\\go\\bin', 'PATHEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC', 'PLATFORMCODE': 'M7', 'PROCESSOR_ARCHITECTURE': 'AMD64', 'PROCESSOR_IDENTIFIER': 'Intel64 Family 6 Model 183 Stepping 1, GenuineIntel', 'PROCESSOR_LEVEL': '6', 'PROCESSOR_REVISION': 'b701', 'PROGRAMDATA': 'C:\\ProgramData', 'PROGRAMFILES': 'C:\\Program Files', 'PROGRAMFILES(X86)': 'C:\\Program Files (x86)', 'PROGRAMW6432': 'C:\\Program Files', 'PSMODULEPATH': '%ProgramFiles%\\WindowsPowerShell\\Modules;C:\\WINDOWS\\system32\\WindowsPowerShell\\v1.0\\Modules', 'PUBLIC': 'C:\\Users\\Public', 'REGIONCODE': 'APJ', 'SESSIONNAME': 'Console', 'SYSTEMDRIVE': 'C:', 'SYSTEMROOT': 'C:\\WINDOWS', 'TEMP': 'C:\\Users\\ngc13\\AppData\\Local\\Temp', 'TMP': 'C:\\Users\\ngc13\\AppData\\Local\\Temp', 'USERDOMAIN': 'LAPTOP-AYJL9', 'USERDOMAIN_ROAMINGPROFILE': 'LAPTOP-AYJL9', 'USERNAME': 'ngc13', 'USERPROFILE': 'C:\\Users\\ngc13', 'WINDIR': 'C:\\WINDOWS', 'ZES_ENABLE_SYSMAN': '1', '_PYI_ARCHIVE_FILE': 'E:\\project_file\\limitless\\ollama-launcher\\dist\\ollama_launcher\\ollama_launcher.exe', '_PYI_PARENT_PROCESS_LEVEL': '1', '__COMPAT_LAYER': 'DetectorsAppHealth', 'TCL_LIBRARY': 'E:\\project_file\\limitless\\ollama-launcher\\dist\\ollama_launcher\\_internal\\_tcl_data', 'TK_LIBRARY': 'E:\\project_file\\limitless\\ollama-launcher\\dist\\ollama_launcher\\_internal\\_tk_data', 'OLLAMA_MODELS': 'E:/LLM/ollama_models', 'OLLAMA_TMPDIR': 'E:/LLM/ollama_models/temp', 'OLLAMA_HOST': '127.0.0.1:11434', 'OLLAMA_ORIGINS': '*', 'OLLAMA_CONTEXT_LENGTH': '32768', 'OLLAMA_KV_CACHE_TYPE': 'q8_0', 'OLLAMA_KEEP_ALIVE': '-1', 'OLLAMA_MAX_QUEUE': '512', 'OLLAMA_NUM_PARALLEL': '1', 'OLLAMA_MAX_LOADED_MODELS': '3', 'OLLAMA_ENABLE_CUDA': '1', 'CUDA_VISIBLE_DEVICES': '0', 'OLLAMA_FLASH_ATTENTION': '1', 'OLLAMA_USE_MLOCK': '1', 'OLLAMA_MULTIUSER_CACHE': '0', 'OLLAMA_INTEL_GPU': '0', 'OLLAMA_DEBUG': '1'} [app info] ollama_dir: C:/application/ollama/OLLAMA_FILE [app info] Starting Ollama Server... [app info] Status: Ollama server running (PID: 29928) [app time] 2025 - 05 - 17 00:03:04 --- ollama server started. 2025/05/17 00:03:04 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:32768 OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:3 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:/LLM/ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-05-17T00:03:04.967+08:00 level=INFO source=images.go:463 msg="total blobs: 27" time=2025-05-17T00:03:04.968+08:00 level=INFO source=images.go:470 msg="total unused blobs removed: 0" time=2025-05-17T00:03:04.968+08:00 level=INFO source=routes.go:1300 msg="Listening on 127.0.0.1:11434 (version 0.6.8)" time=2025-05-17T00:03:04.968+08:00 level=DEBUG source=sched.go:107 msg="starting llm scheduler" time=2025-05-17T00:03:04.968+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-05-17T00:03:04.968+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-05-17T00:03:04.968+08:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-05-17T00:03:04.968+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32 time=2025-05-17T00:03:04.968+08:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-05-17T00:03:04.968+08:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll time=2025-05-17T00:03:04.968+08:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\nvml.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvml.dll C:\\Qt\\6.8.1\\mingw_64\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp\\nvml.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvml.dll C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\java8path\\nvml.dll C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath\\nvml.dll C:\\windows\\system32\\nvml.dll C:\\windows\\nvml.dll C:\\windows\\System32\\Wbem\\nvml.dll C:\\windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Users\\Administrator\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Program Files\\HP\\OMEN-Broadcast\\Common\\nvml.dll C:\\Program Files\\Microsoft VS Code\\bin\\nvml.dll C:\\Program Files\\MATLAB\\R2024b\\runtime\\win64\\nvml.dll C:\\Program Files\\MATLAB\\R2024b\\bin\\nvml.dll C:\\ffmpeg\\bin\\nvml.dll C:\\Program Files (x86)\\Wolfram Research\\WolframScript\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\msys2\\nvml.dll C:\\texlive\\2023\\bin\\windows\\nvml.dll C:\\Program Files (x86)\\GnuPG\\bin\\nvml.dll C:\\msys2\\mingw64\\bin\\nvml.dll C:\\miniconda\\nvml.dll C:\\miniconda\\Scripts\\nvml.dll C:\\miniconda\\Library\\bin\\nvml.dll C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.36.32532\\bin\\Hostx64\\x64\\nvml.dll C:\\Program Files (x86)\\MATLAB\\MATLAB Runtime\\v851\\runtime\\win32\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\application\\syspath\\nvml.dll C:\\Program Files\\nodejs\\nvml.dll C:\\ProgramData\\chocolatey\\bin\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Systems 2025.1.1\\target-windows-x64\\nvml.dll C:\\WINDOWS\\system32\\nvml.dll C:\\WINDOWS\\nvml.dll C:\\WINDOWS\\System32\\Wbem\\nvml.dll C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\WINDOWS\\System32\\OpenSSH\\nvml.dll C:\\application\\ollama\\OLLAMA_FILE\\nvml.dll C:\\Program Files\\Go\\bin\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR\\nvml.dll C:\\Users\\ngc13\\scoop\\shims\\nvml.dll C:\\Users\\ngc13\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\ngc13\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\ngc13\\AppData\\Roaming\\npm\\nvml.dll C:\\Users\\ngc13\\go\\bin\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-05-17T00:03:04.968+08:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll" time=2025-05-17T00:03:04.969+08:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\windows\\system32\\nvml.dll C:\\WINDOWS\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-05-17T00:03:04.984+08:00 level=DEBUG source=gpu.go:111 msg="nvidia-ml loaded" library=C:\windows\system32\nvml.dll time=2025-05-17T00:03:04.984+08:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvcuda.dll time=2025-05-17T00:03:04.984+08:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\nvcuda.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvcuda.dll C:\\Qt\\6.8.1\\mingw_64\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp\\nvcuda.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\java8path\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath\\nvcuda.dll C:\\windows\\system32\\nvcuda.dll C:\\windows\\nvcuda.dll C:\\windows\\System32\\Wbem\\nvcuda.dll C:\\windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Users\\Administrator\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Program Files\\HP\\OMEN-Broadcast\\Common\\nvcuda.dll C:\\Program Files\\Microsoft VS Code\\bin\\nvcuda.dll C:\\Program Files\\MATLAB\\R2024b\\runtime\\win64\\nvcuda.dll C:\\Program Files\\MATLAB\\R2024b\\bin\\nvcuda.dll C:\\ffmpeg\\bin\\nvcuda.dll C:\\Program Files (x86)\\Wolfram Research\\WolframScript\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\msys2\\nvcuda.dll C:\\texlive\\2023\\bin\\windows\\nvcuda.dll C:\\Program Files (x86)\\GnuPG\\bin\\nvcuda.dll C:\\msys2\\mingw64\\bin\\nvcuda.dll C:\\miniconda\\nvcuda.dll C:\\miniconda\\Scripts\\nvcuda.dll C:\\miniconda\\Library\\bin\\nvcuda.dll C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.36.32532\\bin\\Hostx64\\x64\\nvcuda.dll C:\\Program Files (x86)\\MATLAB\\MATLAB Runtime\\v851\\runtime\\win32\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\application\\syspath\\nvcuda.dll C:\\Program Files\\nodejs\\nvcuda.dll C:\\ProgramData\\chocolatey\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Systems 2025.1.1\\target-windows-x64\\nvcuda.dll C:\\WINDOWS\\system32\\nvcuda.dll C:\\WINDOWS\\nvcuda.dll C:\\WINDOWS\\System32\\Wbem\\nvcuda.dll C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\WINDOWS\\System32\\OpenSSH\\nvcuda.dll C:\\application\\ollama\\OLLAMA_FILE\\nvcuda.dll C:\\Program Files\\Go\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR\\nvcuda.dll C:\\Users\\ngc13\\scoop\\shims\\nvcuda.dll C:\\Users\\ngc13\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\ngc13\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\ngc13\\AppData\\Roaming\\npm\\nvcuda.dll C:\\Users\\ngc13\\go\\bin\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]" time=2025-05-17T00:03:04.984+08:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll" time=2025-05-17T00:03:04.985+08:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\windows\\system32\\nvcuda.dll C:\\WINDOWS\\system32\\nvcuda.dll]" initializing C:\windows\system32\nvcuda.dll dlsym: cuInit - 00007FFFE76B1F80 dlsym: cuDriverGetVersion - 00007FFFE76B2020 dlsym: cuDeviceGetCount - 00007FFFE76B2816 dlsym: cuDeviceGet - 00007FFFE76B2810 dlsym: cuDeviceGetAttribute - 00007FFFE76B2170 dlsym: cuDeviceGetUuid - 00007FFFE76B2822 dlsym: cuDeviceGetName - 00007FFFE76B281C dlsym: cuCtxCreate_v3 - 00007FFFE76B2894 dlsym: cuMemGetInfo_v2 - 00007FFFE76B2996 dlsym: cuCtxDestroy - 00007FFFE76B28A6 calling cuInit calling cuDriverGetVersion raw version 0x2f3a CUDA driver version: 12.9 calling cuDeviceGetCount device count 1 time=2025-05-17T00:03:04.998+08:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=C:\windows\system32\nvcuda.dll [GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a] CUDA totalMem 20479 mb [GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a] CUDA freeMem 19273 mb [GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a] Compute Capability 8.6 time=2025-05-17T00:03:05.149+08:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found." releasing cuda driver library releasing nvml library time=2025-05-17T00:03:05.150+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3080" total="20.0 GiB" available="18.8 GiB" [GIN] 2025/05/17 - 00:03:09 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/05/17 - 00:03:09 | 200 | 532.6µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/05/17 - 00:03:18 | 200 | 0s | 127.0.0.1 | HEAD "/" time=2025-05-17T00:03:18.473+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-17T00:03:18.489+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 [GIN] 2025/05/17 - 00:03:18 | 200 | 27.4373ms | 127.0.0.1 | POST "/api/show" time=2025-05-17T00:03:18.508+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-17T00:03:18.508+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.3 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB" time=2025-05-17T00:03:18.631+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.8 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB" releasing nvml library time=2025-05-17T00:03:18.640+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-17T00:03:18.648+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-17T00:03:18.648+08:00 level=DEBUG source=sched.go:227 msg="loading first model" model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e time=2025-05-17T00:03:18.648+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[18.1 GiB]" time=2025-05-17T00:03:18.648+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0 time=2025-05-17T00:03:18.648+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB" time=2025-05-17T00:03:18.677+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB" releasing nvml library time=2025-05-17T00:03:18.680+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[18.1 GiB]" time=2025-05-17T00:03:18.680+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0 time=2025-05-17T00:03:18.680+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB" time=2025-05-17T00:03:18.724+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB" releasing nvml library time=2025-05-17T00:03:18.725+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB" time=2025-05-17T00:03:18.754+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB" releasing nvml library time=2025-05-17T00:03:18.755+08:00 level=INFO source=server.go:106 msg="system memory" total="63.7 GiB" free="46.2 GiB" free_swap="43.4 GiB" time=2025-05-17T00:03:18.755+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[18.1 GiB]" time=2025-05-17T00:03:18.755+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0 time=2025-05-17T00:03:18.755+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB" time=2025-05-17T00:03:18.785+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB" releasing nvml library time=2025-05-17T00:03:18.786+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=999 layers.model=41 layers.offload=41 layers.split="" memory.available="[18.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.2 GiB" memory.required.partial="11.2 GiB" memory.required.kv="1.2 GiB" memory.required.allocations="[11.2 GiB]" memory.weights.total="8.2 GiB" memory.weights.repeating="7.6 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="1.0 GiB" memory.graph.partial="1.0 GiB" time=2025-05-17T00:03:18.786+08:00 level=INFO source=server.go:186 msg="enabling flash attention" time=2025-05-17T00:03:18.787+08:00 level=DEBUG source=server.go:263 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]" llama_model_loader: loaded meta data with 27 key-value pairs and 443 tensors from E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 14B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 14B llama_model_loader: - kv 5: qwen3.block_count u32 = 40 llama_model_loader: - kv 6: qwen3.context_length u32 = 40960 llama_model_loader: - kv 7: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 17408 llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 40 llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: general.file_type u32 = 15 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type f16: 40 tensors llama_model_loader: - type q4_K: 221 tensors llama_model_loader: - type q6_K: 21 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 8.63 GiB (5.02 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 14.77 B print_info: general.name = Qwen3 14B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-05-17T00:03:18.924+08:00 level=DEBUG source=server.go:339 msg="adding gpu library" path=C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12 time=2025-05-17T00:03:18.924+08:00 level=DEBUG source=server.go:346 msg="adding gpu dependency paths" paths=[C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12] time=2025-05-17T00:03:18.924+08:00 level=INFO source=server.go:410 msg="starting llama server" cmd="C:\\application\\ollama\\OLLAMA_FILE\\ollama.exe runner --model E:\\LLM\\ollama_models\\blobs\\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e --ctx-size 16384 --batch-size 512 --n-gpu-layers 999 --verbose --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --port 59210" time=2025-05-17T00:03:18.924+08:00 level=DEBUG source=server.go:429 msg=subprocess environment="[CUDA_HOME=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_1=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_VISIBLE_DEVICES=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_DEBUG=1 OLLAMA_ENABLE_CUDA=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=127.0.0.1:11434 OLLAMA_INTEL_GPU=0 OLLAMA_KEEP_ALIVE=-1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_MAX_QUEUE=512 OLLAMA_MODELS=E:/LLM/ollama_models OLLAMA_MULTIUSER_CACHE=0 OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* OLLAMA_TMPDIR=E:/LLM/ollama_models/temp OLLAMA_USE_MLOCK=1 PATH=C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\cuda_v12;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\cuda_v12;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama;C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\;C:\\Qt\\6.8.1\\mingw_64\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\java8path;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath;C:\\windows\\system32;C:\\windows;C:\\windows\\System32\\Wbem;C:\\windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\windows\\System32\\OpenSSH\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Administrator\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Program Files\\HP\\OMEN-Broadcast\\Common;C:\\Program Files\\Microsoft VS Code\\bin;C:\\Program Files\\MATLAB\\R2024b\\runtime\\win64;C:\\Program Files\\MATLAB\\R2024b\\bin;C:\\ffmpeg\\bin;C:\\Program Files (x86)\\Wolfram Research\\WolframScript\\;C:\\Program Files\\dotnet\\;C:\\msys2;C:\\texlive\\2023\\bin\\windows;C:\\Program Files (x86)\\GnuPG\\bin;C:\\msys2\\mingw64\\bin;C:\\miniconda;C:\\miniconda\\Scripts;C:\\miniconda\\Library\\bin;C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.36.32532\\bin\\Hostx64\\x64;C:\\Program Files (x86)\\MATLAB\\MATLAB Runtime\\v851\\runtime\\win32;C:\\Program Files\\Git\\cmd;C:\\application\\syspath;C:\\Program Files\\nodejs\\;C:\\ProgramData\\chocolatey\\bin;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Systems 2025.1.1\\target-windows-x64;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\application\\ollama\\OLLAMA_FILE;C:\\Program Files\\Go\\bin;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Users\\ngc13\\scoop\\shims;C:\\Users\\ngc13\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\ngc13\\AppData\\Local\\Programs\\Ollama;C:\\Users\\ngc13\\AppData\\Roaming\\npm;C:\\Users\\ngc13\\go\\bin;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama OLLAMA_LIBRARY_PATH=C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\cuda_v12]" time=2025-05-17T00:03:18.928+08:00 level=INFO source=sched.go:452 msg="loaded runners" count=1 time=2025-05-17T00:03:18.928+08:00 level=INFO source=server.go:589 msg="waiting for llama runner to start responding" time=2025-05-17T00:03:18.928+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server error" time=2025-05-17T00:03:18.950+08:00 level=INFO source=runner.go:853 msg="starting go runner" time=2025-05-17T00:03:18.954+08:00 level=DEBUG source=ggml.go:93 msg="ggml backend load all from path" path=C:\application\ollama\OLLAMA_FILE\lib\ollama load_backend: loaded CPU backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\ggml-cpu-alderlake.dll time=2025-05-17T00:03:18.968+08:00 level=DEBUG source=ggml.go:93 msg="ggml backend load all from path" path=C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-05-17T00:03:19.041+08:00 level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-05-17T00:03:19.043+08:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:59210" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080) - 19273 MiB free llama_model_loader: loaded meta data with 27 key-value pairs and 443 tensors from E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 14B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 14B llama_model_loader: - kv 5: qwen3.block_count u32 = 40 llama_model_loader: - kv 6: qwen3.context_length u32 = 40960 llama_model_loader: - kv 7: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 17408 llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 40 llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... time=2025-05-17T00:03:19.180+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: general.file_type u32 = 15 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type f16: 40 tensors llama_model_loader: - type q4_K: 221 tensors llama_model_loader: - type q6_K: 21 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 8.63 GiB (5.02 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 0 print_info: n_ctx_train = 40960 print_info: n_embd = 5120 print_info: n_layer = 40 print_info: n_head = 40 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 5 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 17408 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 40960 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 14B print_info: model params = 14.77 B print_info: general.name = Qwen3 14B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: layer 0 assigned to device CUDA0, is_swa = 0 load_tensors: layer 1 assigned to device CUDA0, is_swa = 0 load_tensors: layer 2 assigned to device CUDA0, is_swa = 0 load_tensors: layer 3 assigned to device CUDA0, is_swa = 0 load_tensors: layer 4 assigned to device CUDA0, is_swa = 0 load_tensors: layer 5 assigned to device CUDA0, is_swa = 0 load_tensors: layer 6 assigned to device CUDA0, is_swa = 0 load_tensors: layer 7 assigned to device CUDA0, is_swa = 0 load_tensors: layer 8 assigned to device CUDA0, is_swa = 0 load_tensors: layer 9 assigned to device CUDA0, is_swa = 0 load_tensors: layer 10 assigned to device CUDA0, is_swa = 0 load_tensors: layer 11 assigned to device CUDA0, is_swa = 0 load_tensors: layer 12 assigned to device CUDA0, is_swa = 0 load_tensors: layer 13 assigned to device CUDA0, is_swa = 0 load_tensors: layer 14 assigned to device CUDA0, is_swa = 0 load_tensors: layer 15 assigned to device CUDA0, is_swa = 0 load_tensors: layer 16 assigned to device CUDA0, is_swa = 0 load_tensors: layer 17 assigned to device CUDA0, is_swa = 0 load_tensors: layer 18 assigned to device CUDA0, is_swa = 0 load_tensors: layer 19 assigned to device CUDA0, is_swa = 0 load_tensors: layer 20 assigned to device CUDA0, is_swa = 0 load_tensors: layer 21 assigned to device CUDA0, is_swa = 0 load_tensors: layer 22 assigned to device CUDA0, is_swa = 0 load_tensors: layer 23 assigned to device CUDA0, is_swa = 0 load_tensors: layer 24 assigned to device CUDA0, is_swa = 0 load_tensors: layer 25 assigned to device CUDA0, is_swa = 0 load_tensors: layer 26 assigned to device CUDA0, is_swa = 0 load_tensors: layer 27 assigned to device CUDA0, is_swa = 0 load_tensors: layer 28 assigned to device CUDA0, is_swa = 0 load_tensors: layer 29 assigned to device CUDA0, is_swa = 0 load_tensors: layer 30 assigned to device CUDA0, is_swa = 0 load_tensors: layer 31 assigned to device CUDA0, is_swa = 0 load_tensors: layer 32 assigned to device CUDA0, is_swa = 0 load_tensors: layer 33 assigned to device CUDA0, is_swa = 0 load_tensors: layer 34 assigned to device CUDA0, is_swa = 0 load_tensors: layer 35 assigned to device CUDA0, is_swa = 0 load_tensors: layer 36 assigned to device CUDA0, is_swa = 0 load_tensors: layer 37 assigned to device CUDA0, is_swa = 0 load_tensors: layer 38 assigned to device CUDA0, is_swa = 0 load_tensors: layer 39 assigned to device CUDA0, is_swa = 0 load_tensors: layer 40 assigned to device CUDA0, is_swa = 0 load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead load_tensors: offloading 40 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 41/41 layers to GPU load_tensors: CPU model buffer size = 417.30 MiB load_tensors: CUDA0 model buffer size = 8423.47 MiB load_all_data: no device found for buffer type CPU for async uploads load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2025-05-17T00:03:19.431+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.05" time=2025-05-17T00:03:19.932+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.14" time=2025-05-17T00:03:20.183+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.20" time=2025-05-17T00:03:20.433+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.26" time=2025-05-17T00:03:20.685+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.32" time=2025-05-17T00:03:20.935+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.37" time=2025-05-17T00:03:21.186+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.43" time=2025-05-17T00:03:21.436+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.49" time=2025-05-17T00:03:21.687+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.54" time=2025-05-17T00:03:21.938+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.61" time=2025-05-17T00:03:22.188+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.67" time=2025-05-17T00:03:22.438+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.72" time=2025-05-17T00:03:22.689+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.78" time=2025-05-17T00:03:22.940+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.84" time=2025-05-17T00:03:23.190+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.89" time=2025-05-17T00:03:23.441+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.96" llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 16384 llama_context: n_ctx_per_seq = 16384 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 1 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (16384) < n_ctx_train (40960) -- the full capacity of the model will not be utilized set_abort_callback: call llama_context: CUDA_Host output buffer size = 0.60 MiB llama_context: n_ctx = 16384 llama_context: n_ctx = 16384 (padded) init: kv_size = 16384, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 40, can_shift = 1 init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: CUDA0 KV buffer size = 1360.00 MiB llama_context: KV self size = 1360.00 MiB, K (q8_0): 680.00 MiB, V (q8_0): 680.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 2 llama_context: max_nodes = 65536 llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: reserving graph for n_tokens = 1, n_seqs = 1 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: CUDA0 compute buffer size = 306.75 MiB llama_context: CUDA_Host compute buffer size = 42.01 MiB llama_context: graph nodes = 1367 llama_context: graph splits = 2 time=2025-05-17T00:03:23.692+08:00 level=INFO source=server.go:628 msg="llama runner started in 4.76 seconds" time=2025-05-17T00:03:23.692+08:00 level=DEBUG source=sched.go:464 msg="finished setting up runner" model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e [GIN] 2025/05/17 - 00:03:23 | 200 | 5.1957925s | 127.0.0.1 | POST "/api/generate" time=2025-05-17T00:03:23.692+08:00 level=DEBUG source=sched.go:472 msg="context for request finished" time=2025-05-17T00:03:23.692+08:00 level=DEBUG source=sched.go:342 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3:14b-16k runner.inference=cuda runner.devices=1 runner.size="11.2 GiB" runner.vram="11.2 GiB" runner.num_ctx=16384 runner.parallel=1 runner.pid=28284 runner.model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e duration=2562047h47m16.854775807s time=2025-05-17T00:03:23.692+08:00 level=DEBUG source=sched.go:360 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3:14b-16k runner.inference=cuda runner.devices=1 runner.size="11.2 GiB" runner.vram="11.2 GiB" runner.num_ctx=16384 runner.parallel=1 runner.pid=28284 runner.model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e refCount=0 [GIN] 2025/05/17 - 00:03:30 | 200 | 0s | 127.0.0.1 | HEAD "/" time=2025-05-17T00:03:30.819+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-17T00:03:30.830+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 [GIN] 2025/05/17 - 00:03:30 | 200 | 18.9695ms | 127.0.0.1 | POST "/api/show" time=2025-05-17T00:03:30.845+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-17T00:03:30.846+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="45.4 GiB" now.free_swap="32.0 GiB" time=2025-05-17T00:03:30.964+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="8.1 GiB" now.used="11.9 GiB" releasing nvml library time=2025-05-17T00:03:30.978+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-17T00:03:30.985+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-17T00:03:30.986+08:00 level=DEBUG source=sched.go:506 msg="gpu reported" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a library=cuda available="8.1 GiB" time=2025-05-17T00:03:30.986+08:00 level=INFO source=sched.go:517 msg="updated VRAM based on existing loaded models" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a library=cuda total="20.0 GiB" available="8.1 GiB" time=2025-05-17T00:03:30.986+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[8.1 GiB]" time=2025-05-17T00:03:30.986+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.vision.block_count default=0 time=2025-05-17T00:03:30.986+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="45.4 GiB" before.free_swap="32.0 GiB" now.total="63.7 GiB" now.free="45.4 GiB" now.free_swap="32.0 GiB" time=2025-05-17T00:03:31.026+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="8.1 GiB" now.total="20.0 GiB" now.free="8.1 GiB" now.used="11.9 GiB" releasing nvml library time=2025-05-17T00:03:31.028+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-05-17T00:03:31.029+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-05-17T00:03:31.029+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-05-17T00:03:31.029+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-05-17T00:03:31.029+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[8.1 GiB]" time=2025-05-17T00:03:31.029+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.vision.block_count default=0 time=2025-05-17T00:03:31.030+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="45.4 GiB" before.free_swap="32.0 GiB" now.total="63.7 GiB" now.free="45.4 GiB" now.free_swap="32.0 GiB" time=2025-05-17T00:03:31.073+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="8.1 GiB" now.total="20.0 GiB" now.free="8.1 GiB" now.used="11.9 GiB" releasing nvml library time=2025-05-17T00:03:31.074+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-05-17T00:03:31.074+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-05-17T00:03:31.074+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-05-17T00:03:31.074+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-05-17T00:03:31.074+08:00 level=DEBUG source=sched.go:824 msg="found an idle runner to unload" runner.name=registry.ollama.ai/library/qwen3:14b-16k runner.inference=cuda runner.devices=1 runner.size="11.2 GiB" runner.vram="11.2 GiB" runner.num_ctx=16384 runner.parallel=1 runner.pid=28284 runner.model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e time=2025-05-17T00:03:31.074+08:00 level=DEBUG source=sched.go:286 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/qwen3:14b-16k runner.inference=cuda runner.devices=1 runner.size="11.2 GiB" runner.vram="11.2 GiB" runner.num_ctx=16384 runner.parallel=1 runner.pid=28284 runner.model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e refCount=0 time=2025-05-17T00:03:31.074+08:00 level=DEBUG source=sched.go:299 msg="waiting for pending requests to complete and unload to occur" modelPath=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e time=2025-05-17T00:03:31.074+08:00 level=DEBUG source=sched.go:363 msg="runner expired event received" runner.name=registry.ollama.ai/library/qwen3:14b-16k runner.inference=cuda runner.devices=1 runner.size="11.2 GiB" runner.vram="11.2 GiB" runner.num_ctx=16384 runner.parallel=1 runner.pid=28284 runner.model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e time=2025-05-17T00:03:31.074+08:00 level=DEBUG source=sched.go:378 msg="got lock to unload" runner.name=registry.ollama.ai/library/qwen3:14b-16k runner.inference=cuda runner.devices=1 runner.size="11.2 GiB" runner.vram="11.2 GiB" runner.num_ctx=16384 runner.parallel=1 runner.pid=28284 runner.model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e time=2025-05-17T00:03:31.074+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="45.4 GiB" before.free_swap="32.0 GiB" now.total="63.7 GiB" now.free="45.4 GiB" now.free_swap="32.0 GiB" time=2025-05-17T00:03:31.104+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="8.1 GiB" now.total="20.0 GiB" now.free="8.1 GiB" now.used="11.9 GiB" releasing nvml library time=2025-05-17T00:03:31.121+08:00 level=DEBUG source=server.go:1017 msg="stopping llama server" time=2025-05-17T00:03:31.121+08:00 level=DEBUG source=server.go:1023 msg="waiting for llama server to exit" time=2025-05-17T00:03:31.248+08:00 level=DEBUG source=server.go:1027 msg="llama server stopped" time=2025-05-17T00:03:31.248+08:00 level=DEBUG source=sched.go:383 msg="runner released" runner="LogValue panicked\ncalled from runtime.panicmem (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/panic.go:262)\ncalled from runtime.sigpanic (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/signal_windows.go:401)\ncalled from github.com/ollama/ollama/server.(*runnerRef).LogValue (C:/a/ollama/ollama/server/sched.go:694)\ncalled from log/slog.Value.Resolve (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/value.go:512)\ncalled from log/slog.(*handleState).appendAttr (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/handler.go:468)\n(rest of stack elided)\n" time=2025-05-17T00:03:31.355+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="45.4 GiB" before.free_swap="32.0 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB" time=2025-05-17T00:03:31.397+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="8.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB" releasing nvml library time=2025-05-17T00:03:31.398+08:00 level=DEBUG source=sched.go:668 msg="gpu VRAM free memory converged after 0.32 seconds" model=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e time=2025-05-17T00:03:31.398+08:00 level=DEBUG source=sched.go:387 msg="sending an unloaded event" runner="LogValue panicked\ncalled from runtime.panicmem (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/panic.go:262)\ncalled from runtime.sigpanic (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/signal_windows.go:401)\ncalled from github.com/ollama/ollama/server.(*runnerRef).LogValue (C:/a/ollama/ollama/server/sched.go:694)\ncalled from log/slog.Value.Resolve (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/value.go:512)\ncalled from log/slog.(*handleState).appendAttr (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/handler.go:468)\n(rest of stack elided)\n" time=2025-05-17T00:03:31.399+08:00 level=DEBUG source=sched.go:305 msg="unload completed" modelPath=E:\LLM\ollama_models\blobs\sha256-a8cc1361f3145dc01f6d77c6c82c9116b9ffe3c97b34716fe20418455876c40e time=2025-05-17T00:03:31.399+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB" time=2025-05-17T00:03:31.428+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB" releasing nvml library time=2025-05-17T00:03:31.439+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-17T00:03:31.451+08:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-17T00:03:31.451+08:00 level=DEBUG source=sched.go:227 msg="loading first model" model=E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 time=2025-05-17T00:03:31.451+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[18.1 GiB]" time=2025-05-17T00:03:31.451+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.vision.block_count default=0 time=2025-05-17T00:03:31.451+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB" time=2025-05-17T00:03:31.490+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB" releasing nvml library time=2025-05-17T00:03:31.491+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-05-17T00:03:31.491+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-05-17T00:03:31.491+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-05-17T00:03:31.491+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-05-17T00:03:31.491+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[18.1 GiB]" time=2025-05-17T00:03:31.491+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.vision.block_count default=0 time=2025-05-17T00:03:31.491+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB" time=2025-05-17T00:03:31.521+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB" releasing nvml library time=2025-05-17T00:03:31.523+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-05-17T00:03:31.523+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-05-17T00:03:31.523+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-05-17T00:03:31.523+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-05-17T00:03:31.523+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB" time=2025-05-17T00:03:31.552+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB" releasing nvml library time=2025-05-17T00:03:31.554+08:00 level=INFO source=server.go:106 msg="system memory" total="63.7 GiB" free="46.2 GiB" free_swap="43.4 GiB" time=2025-05-17T00:03:31.554+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[18.1 GiB]" time=2025-05-17T00:03:31.555+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.vision.block_count default=0 time=2025-05-17T00:03:31.555+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="63.7 GiB" before.free="46.2 GiB" before.free_swap="43.4 GiB" now.total="63.7 GiB" now.free="46.2 GiB" now.free_swap="43.4 GiB" time=2025-05-17T00:03:31.584+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a name="NVIDIA GeForce RTX 3080" overhead="0 B" before.total="20.0 GiB" before.free="18.1 GiB" now.total="20.0 GiB" now.free="18.1 GiB" now.used="1.9 GiB" releasing nvml library time=2025-05-17T00:03:31.584+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-05-17T00:03:31.584+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-05-17T00:03:31.584+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-05-17T00:03:31.584+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-05-17T00:03:31.584+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=999 layers.model=29 layers.offload=29 layers.split="" memory.available="[18.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.7 GiB" memory.required.partial="2.7 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[2.7 GiB]" memory.weights.total="934.7 MiB" memory.weights.repeating="752.1 MiB" memory.weights.nonrepeating="182.6 MiB" memory.graph.full="838.0 MiB" memory.graph.partial="1.0 GiB" time=2025-05-17T00:03:31.584+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-05-17T00:03:31.584+08:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-05-17T00:03:31.585+08:00 level=INFO source=server.go:186 msg="enabling flash attention" time=2025-05-17T00:03:31.585+08:00 level=DEBUG source=server.go:263 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]" llama_model_loader: loaded meta data with 34 key-value pairs and 338 tensors from E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen2.5 Coder 1.5B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Qwen2.5-Coder llama_model_loader: - kv 5: general.size_label str = 1.5B llama_model_loader: - kv 6: general.license str = apache-2.0 llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-C... llama_model_loader: - kv 8: general.base_model.count u32 = 1 llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 Coder 1.5B llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-C... llama_model_loader: - kv 12: general.tags arr[str,6] = ["code", "codeqwen", "chat", "qwen", ... llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 14: qwen2.block_count u32 = 28 llama_model_loader: - kv 15: qwen2.context_length u32 = 32768 llama_model_loader: - kv 16: qwen2.embedding_length u32 = 1536 llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 8960 llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 12 llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 22: general.file_type u32 = 15 llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 33: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 934.69 MiB (5.08 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 1.54 B print_info: general.name = Qwen2.5 Coder 1.5B Instruct print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-05-17T00:03:31.737+08:00 level=DEBUG source=server.go:339 msg="adding gpu library" path=C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12 time=2025-05-17T00:03:31.737+08:00 level=DEBUG source=server.go:346 msg="adding gpu dependency paths" paths=[C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12] time=2025-05-17T00:03:31.737+08:00 level=INFO source=server.go:410 msg="starting llama server" cmd="C:\\application\\ollama\\OLLAMA_FILE\\ollama.exe runner --model E:\\LLM\\ollama_models\\blobs\\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 --ctx-size 32768 --batch-size 512 --n-gpu-layers 999 --verbose --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --port 59218" time=2025-05-17T00:03:31.737+08:00 level=DEBUG source=server.go:429 msg=subprocess environment="[CUDA_HOME=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_1=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_VISIBLE_DEVICES=GPU-4e553830-2b28-e8c0-03c9-6e8fc829048a OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_DEBUG=1 OLLAMA_ENABLE_CUDA=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=127.0.0.1:11434 OLLAMA_INTEL_GPU=0 OLLAMA_KEEP_ALIVE=-1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_MAX_QUEUE=512 OLLAMA_MODELS=E:/LLM/ollama_models OLLAMA_MULTIUSER_CACHE=0 OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* OLLAMA_TMPDIR=E:/LLM/ollama_models/temp OLLAMA_USE_MLOCK=1 PATH=C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\cuda_v12;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\cuda_v12;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama;C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\;C:\\Qt\\6.8.1\\mingw_64\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\java8path;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath;C:\\windows\\system32;C:\\windows;C:\\windows\\System32\\Wbem;C:\\windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\windows\\System32\\OpenSSH\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Administrator\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Program Files\\HP\\OMEN-Broadcast\\Common;C:\\Program Files\\Microsoft VS Code\\bin;C:\\Program Files\\MATLAB\\R2024b\\runtime\\win64;C:\\Program Files\\MATLAB\\R2024b\\bin;C:\\ffmpeg\\bin;C:\\Program Files (x86)\\Wolfram Research\\WolframScript\\;C:\\Program Files\\dotnet\\;C:\\msys2;C:\\texlive\\2023\\bin\\windows;C:\\Program Files (x86)\\GnuPG\\bin;C:\\msys2\\mingw64\\bin;C:\\miniconda;C:\\miniconda\\Scripts;C:\\miniconda\\Library\\bin;C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.36.32532\\bin\\Hostx64\\x64;C:\\Program Files (x86)\\MATLAB\\MATLAB Runtime\\v851\\runtime\\win32;C:\\Program Files\\Git\\cmd;C:\\application\\syspath;C:\\Program Files\\nodejs\\;C:\\ProgramData\\chocolatey\\bin;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Systems 2025.1.1\\target-windows-x64;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\application\\ollama\\OLLAMA_FILE;C:\\Program Files\\Go\\bin;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Users\\ngc13\\scoop\\shims;C:\\Users\\ngc13\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\ngc13\\AppData\\Local\\Programs\\Ollama;C:\\Users\\ngc13\\AppData\\Roaming\\npm;C:\\Users\\ngc13\\go\\bin;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama OLLAMA_LIBRARY_PATH=C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama;C:\\application\\ollama\\OLLAMA_FILE\\lib\\ollama\\cuda_v12]" time=2025-05-17T00:03:31.742+08:00 level=INFO source=sched.go:452 msg="loaded runners" count=1 time=2025-05-17T00:03:31.742+08:00 level=INFO source=server.go:589 msg="waiting for llama runner to start responding" time=2025-05-17T00:03:31.742+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server error" time=2025-05-17T00:03:31.765+08:00 level=INFO source=runner.go:853 msg="starting go runner" time=2025-05-17T00:03:31.769+08:00 level=DEBUG source=ggml.go:93 msg="ggml backend load all from path" path=C:\application\ollama\OLLAMA_FILE\lib\ollama load_backend: loaded CPU backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\ggml-cpu-alderlake.dll time=2025-05-17T00:03:31.780+08:00 level=DEBUG source=ggml.go:93 msg="ggml backend load all from path" path=C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\application\ollama\OLLAMA_FILE\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-05-17T00:03:31.857+08:00 level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-05-17T00:03:31.858+08:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:59218" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080) - 19273 MiB free llama_model_loader: loaded meta data with 34 key-value pairs and 338 tensors from E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen2.5 Coder 1.5B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Qwen2.5-Coder llama_model_loader: - kv 5: general.size_label str = 1.5B llama_model_loader: - kv 6: general.license str = apache-2.0 llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-C... llama_model_loader: - kv 8: general.base_model.count u32 = 1 llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 Coder 1.5B llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-C... llama_model_loader: - kv 12: general.tags arr[str,6] = ["code", "codeqwen", "chat", "qwen", ... llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 14: qwen2.block_count u32 = 28 llama_model_loader: - kv 15: qwen2.context_length u32 = 32768 llama_model_loader: - kv 16: qwen2.embedding_length u32 = 1536 llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 8960 llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 12 llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 22: general.file_type u32 = 15 llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2 time=2025-05-17T00:03:31.994+08:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 33: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 934.69 MiB (5.08 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 0 print_info: n_ctx_train = 32768 print_info: n_embd = 1536 print_info: n_layer = 28 print_info: n_head = 12 print_info: n_head_kv = 2 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 6 print_info: n_embd_k_gqa = 256 print_info: n_embd_v_gqa = 256 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 8960 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 32768 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 1.5B print_info: model params = 1.54 B print_info: general.name = Qwen2.5 Coder 1.5B Instruct print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: layer 0 assigned to device CUDA0, is_swa = 0 load_tensors: layer 1 assigned to device CUDA0, is_swa = 0 load_tensors: layer 2 assigned to device CUDA0, is_swa = 0 load_tensors: layer 3 assigned to device CUDA0, is_swa = 0 load_tensors: layer 4 assigned to device CUDA0, is_swa = 0 load_tensors: layer 5 assigned to device CUDA0, is_swa = 0 load_tensors: layer 6 assigned to device CUDA0, is_swa = 0 load_tensors: layer 7 assigned to device CUDA0, is_swa = 0 load_tensors: layer 8 assigned to device CUDA0, is_swa = 0 load_tensors: layer 9 assigned to device CUDA0, is_swa = 0 load_tensors: layer 10 assigned to device CUDA0, is_swa = 0 load_tensors: layer 11 assigned to device CUDA0, is_swa = 0 load_tensors: layer 12 assigned to device CUDA0, is_swa = 0 load_tensors: layer 13 assigned to device CUDA0, is_swa = 0 load_tensors: layer 14 assigned to device CUDA0, is_swa = 0 load_tensors: layer 15 assigned to device CUDA0, is_swa = 0 load_tensors: layer 16 assigned to device CUDA0, is_swa = 0 load_tensors: layer 17 assigned to device CUDA0, is_swa = 0 load_tensors: layer 18 assigned to device CUDA0, is_swa = 0 load_tensors: layer 19 assigned to device CUDA0, is_swa = 0 load_tensors: layer 20 assigned to device CUDA0, is_swa = 0 load_tensors: layer 21 assigned to device CUDA0, is_swa = 0 load_tensors: layer 22 assigned to device CUDA0, is_swa = 0 load_tensors: layer 23 assigned to device CUDA0, is_swa = 0 load_tensors: layer 24 assigned to device CUDA0, is_swa = 0 load_tensors: layer 25 assigned to device CUDA0, is_swa = 0 load_tensors: layer 26 assigned to device CUDA0, is_swa = 0 load_tensors: layer 27 assigned to device CUDA0, is_swa = 0 load_tensors: layer 28 assigned to device CUDA0, is_swa = 0 load_tensors: tensor 'token_embd.weight' (q6_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead load_tensors: offloading 28 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 29/29 layers to GPU load_tensors: CPU model buffer size = 182.57 MiB load_tensors: CUDA0 model buffer size = 934.70 MiB load_all_data: no device found for buffer type CPU for async uploads load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2025-05-17T00:03:32.244+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.36" time=2025-05-17T00:03:32.495+08:00 level=DEBUG source=server.go:634 msg="model load progress 0.81" llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 32768 llama_context: n_ctx_per_seq = 32768 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 1 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 set_abort_callback: call llama_context: CUDA_Host output buffer size = 0.59 MiB llama_context: n_ctx = 32768 llama_context: n_ctx = 32768 (padded) init: kv_size = 32768, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 28, can_shift = 1 init: layer 0: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 1: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 2: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 3: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 4: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 5: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 6: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 7: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 8: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 9: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 10: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 11: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 12: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 13: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 14: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 15: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 16: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 17: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 18: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 19: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 20: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 21: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 22: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 23: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 24: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 25: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 26: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: layer 27: n_embd_k_gqa = 256, n_embd_v_gqa = 256, dev = CUDA0 init: CUDA0 KV buffer size = 476.00 MiB llama_context: KV self size = 476.00 MiB, K (q8_0): 238.00 MiB, V (q8_0): 238.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 2 llama_context: max_nodes = 65536 llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: reserving graph for n_tokens = 1, n_seqs = 1 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: CUDA0 compute buffer size = 299.75 MiB llama_context: CUDA_Host compute buffer size = 67.01 MiB llama_context: graph nodes = 931 llama_context: graph splits = 2 time=2025-05-17T00:03:32.745+08:00 level=INFO source=server.go:628 msg="llama runner started in 1.00 seconds" time=2025-05-17T00:03:32.745+08:00 level=DEBUG source=sched.go:464 msg="finished setting up runner" model=E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 time=2025-05-17T00:03:32.745+08:00 level=DEBUG source=sched.go:472 msg="context for request finished" [GIN] 2025/05/17 - 00:03:32 | 200 | 1.9096022s | 127.0.0.1 | POST "/api/generate" time=2025-05-17T00:03:32.745+08:00 level=DEBUG source=sched.go:342 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen2.5-coder:1.5b-gpu runner.inference=cuda runner.devices=1 runner.size="2.7 GiB" runner.vram="2.7 GiB" runner.num_ctx=32768 runner.parallel=1 runner.pid=29412 runner.model=E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 duration=2562047h47m16.854775807s time=2025-05-17T00:03:32.745+08:00 level=DEBUG source=sched.go:360 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen2.5-coder:1.5b-gpu runner.inference=cuda runner.devices=1 runner.size="2.7 GiB" runner.vram="2.7 GiB" runner.num_ctx=32768 runner.parallel=1 runner.pid=29412 runner.model=E:\LLM\ollama_models\blobs\sha256-29d8c98fa6b098e200069bfb88b9508dc3e85586d20cba59f8dda9a808165104 refCount=0 [GIN] 2025/05/17 - 00:03:35 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/05/17 - 00:03:35 | 200 | 544.5µs | 127.0.0.1 | GET "/api/ps" ```
Author
Owner

@Soul2294 commented on GitHub (May 18, 2025):

similar issues for me, after a recent update broke my old model, I completely wiped & reinstalled,
but now i'm no longer able to run multiple models in parallel by default anymore?

& environment variables aren't having any effect.
the only functional system var atm is the custom models directory.

<!-- gh-comment-id:2888694484 --> @Soul2294 commented on GitHub (May 18, 2025): similar issues for me, after a recent update broke my old model, I completely wiped & reinstalled, but now i'm no longer able to run multiple models in parallel by default anymore? & environment variables aren't having any effect. the only functional system var atm is the custom models directory.
Author
Owner

@NGC13009 commented on GitHub (May 24, 2025):

v0.7.1 似乎修复了这个问题,现在能同时使用多个模型了。

<!-- gh-comment-id:2906709151 --> @NGC13009 commented on GitHub (May 24, 2025): v0.7.1 似乎修复了这个问题,现在能同时使用多个模型了。
Author
Owner

@trdischat commented on GitHub (Jun 2, 2025):

I am having the same problem. It doesn't matter what model I use or what environment settings I use, nothing runs concurrently, and querying one model always unloads the other model from memory, regardless of model size. I tried this with fairly small models (mistral:7b and llama2:7b) as well as many others. I am running ollama 0.9.0 on Ubuntu 20.04 and have OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS both set to 4 in my ollama.service file. I have 96G of system RAM and 12G of VRAM on an Nvidia card; so there should be enough memory to load multiple models.

<!-- gh-comment-id:2932478334 --> @trdischat commented on GitHub (Jun 2, 2025): I am having the same problem. It doesn't matter what model I use or what environment settings I use, nothing runs concurrently, and querying one model always unloads the other model from memory, regardless of model size. I tried this with fairly small models (mistral:7b and llama2:7b) as well as many others. I am running ollama 0.9.0 on Ubuntu 20.04 and have OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS both set to 4 in my ollama.service file. I have 96G of system RAM and 12G of VRAM on an Nvidia card; so there should be enough memory to load multiple models.
Author
Owner

@rick-github commented on GitHub (Jun 2, 2025):

@trdischat Open a new issue and add logs: https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues

<!-- gh-comment-id:2932500931 --> @rick-github commented on GitHub (Jun 2, 2025): @trdischat Open a new issue and add logs: https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues
Author
Owner

@trdischat commented on GitHub (Jun 2, 2025):

Thanks. See #10952.

<!-- gh-comment-id:2932636087 --> @trdischat commented on GitHub (Jun 2, 2025): Thanks. See #10952.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7052