[GH-ISSUE #12574] Ollama doesn't use NVIDIA GPU after updated to v0.12.5 #8342

Closed
opened 2026-04-12 20:55:50 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @devrockin on GitHub (Oct 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12574

What is the issue?

Ollama doesn't use NVIDIA RTX4080 GPU after updated to v0.12.5 or reinstalled. It worked on v0.12.3.
Here is a server.log.

Relevant log output

time=2025-10-11T18:56:40.321+08:00 level=INFO source=routes.go:1481 msg="server config" env="map[CUDA_VISIBLE_DEVICES:GPU-6026f538-f122-eb94-1388-5d1156d22bff GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY:cuda OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\AIPro\\OllamaFiles\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-10-11T18:56:40.341+08:00 level=INFO source=images.go:522 msg="total blobs: 104"
time=2025-10-11T18:56:40.345+08:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0"
time=2025-10-11T18:56:40.349+08:00 level=INFO source=routes.go:1534 msg="Listening on [::]:11434 (version 0.12.5)"
time=2025-10-11T18:56:40.350+08:00 level=INFO source=runner.go:80 msg="discovering available GPUs..."
time=2025-10-11T18:56:40.350+08:00 level=INFO source=types.go:129 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="63.7 GiB" available="43.7 GiB"
time=2025-10-11T18:56:40.350+08:00 level=INFO source=routes.go:1575 msg="entering low vram mode" "total vram"="0 B" threshold="20.0 GiB"
[GIN] 2025/10/11 - 18:56:41 | 200 |       593.6µs |       127.0.0.1 | GET      "/api/version"
[GIN] 2025/10/11 - 18:56:41 | 200 |      8.6026ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/10/11 - 18:56:41 | 200 |    133.0919ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/10/11 - 18:56:54 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
[GIN] 2025/10/11 - 18:56:54 | 200 |      5.3169ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/10/11 - 18:57:00 | 200 |     40.6545ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/10/11 - 18:57:04 | 200 |      5.5555ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/10/11 - 18:57:04 | 200 |     30.4073ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/10/11 - 18:57:04 | 200 |     28.2614ms |       127.0.0.1 | POST     "/api/show"
llama_model_loader: loaded meta data with 33 key-value pairs and 398 tensors from D:\AIPro\OllamaFiles\models\blobs\sha256-b7f3f749aa86d5bf23c399a579382c0f0a52cf97bd5c66df60c4f960905431c2 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = jan-v1-2509
llama_model_loader: - kv   3:                            general.version str              = 240
llama_model_loader: - kv   4:                           general.basename str              = checkpoint
llama_model_loader: - kv   5:                         general.size_label str              = 4.0B
llama_model_loader: - kv   6:                          qwen3.block_count u32              = 36
llama_model_loader: - kv   7:                       qwen3.context_length u32              = 262144
llama_model_loader: - kv   8:                     qwen3.embedding_length u32              = 2560
llama_model_loader: - kv   9:                  qwen3.feed_forward_length u32              = 9728
llama_model_loader: - kv  10:                 qwen3.attention.head_count u32              = 32
llama_model_loader: - kv  11:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                       qwen3.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  13:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  15:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  20:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  25:                           general.finetune str              = jan-v1-2509
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                          general.file_type u32              = 7
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = ./gguf_upload_2/imatrix.gguf
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = ./qwen_calibration_with_chat.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count u32              = 252
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count u32              = 6
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q8_0:  253 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 3.98 GiB (8.50 BPW) 
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 4.02 B
print_info: general.name     = jan-v1-2509
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-10-11T18:57:04.859+08:00 level=INFO source=cpu_windows.go:139 msg=packages count=1
time=2025-10-11T18:57:04.859+08:00 level=INFO source=cpu_windows.go:155 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-10-11T18:57:04.859+08:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=24 efficiency=16 threads=32
time=2025-10-11T18:57:04.859+08:00 level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-10-11T18:57:04.867+08:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\fouae\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\AIPro\\OllamaFiles\\models\\blobs\\sha256-b7f3f749aa86d5bf23c399a579382c0f0a52cf97bd5c66df60c4f960905431c2 --port 27773"
time=2025-10-11T18:57:04.874+08:00 level=INFO source=cpu_windows.go:139 msg=packages count=1
time=2025-10-11T18:57:04.874+08:00 level=INFO source=cpu_windows.go:155 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-10-11T18:57:04.874+08:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=24 efficiency=16 threads=32
time=2025-10-11T18:57:04.874+08:00 level=INFO source=server.go:505 msg="system memory" total="63.7 GiB" free="44.7 GiB" free_swap="47.8 GiB"
time=2025-10-11T18:57:04.875+08:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=D:\AIPro\OllamaFiles\models\blobs\sha256-b7f3f749aa86d5bf23c399a579382c0f0a52cf97bd5c66df60c4f960905431c2 library=cpu parallel=1 required="0 B" gpus=1
time=2025-10-11T18:57:04.875+08:00 level=INFO source=server.go:545 msg=offload library=cpu layers.requested=-1 layers.model=37 layers.offload=0 layers.split=[] memory.available="[44.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.0 GiB" memory.required.partial="0 B" memory.required.kv="576.0 MiB" memory.required.allocations="[5.0 GiB]" memory.weights.total="4.0 GiB" memory.weights.repeating="3.6 GiB" memory.weights.nonrepeating="394.1 MiB" memory.graph.full="384.0 MiB" memory.graph.partial="384.0 MiB"
time=2025-10-11T18:57:04.910+08:00 level=INFO source=runner.go:864 msg="starting go runner"
load_backend: loaded CPU backend from C:\Users\fouae\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-10-11T18:57:04.931+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
time=2025-10-11T18:57:04.932+08:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:27773"
time=2025-10-11T18:57:04.940+08:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T18:57:04.940+08:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding"
time=2025-10-11T18:57:04.941+08:00 level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 33 key-value pairs and 398 tensors from D:\AIPro\OllamaFiles\models\blobs\sha256-b7f3f749aa86d5bf23c399a579382c0f0a52cf97bd5c66df60c4f960905431c2 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = jan-v1-2509
llama_model_loader: - kv   3:                            general.version str              = 240
llama_model_loader: - kv   4:                           general.basename str              = checkpoint
llama_model_loader: - kv   5:                         general.size_label str              = 4.0B
llama_model_loader: - kv   6:                          qwen3.block_count u32              = 36
llama_model_loader: - kv   7:                       qwen3.context_length u32              = 262144
llama_model_loader: - kv   8:                     qwen3.embedding_length u32              = 2560
llama_model_loader: - kv   9:                  qwen3.feed_forward_length u32              = 9728
llama_model_loader: - kv  10:                 qwen3.attention.head_count u32              = 32
llama_model_loader: - kv  11:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                       qwen3.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  13:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  15:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  20:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  25:                           general.finetune str              = jan-v1-2509
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                          general.file_type u32              = 7
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = ./gguf_upload_2/imatrix.gguf
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = ./qwen_calibration_with_chat.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count u32              = 252
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count u32              = 6
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q8_0:  253 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 3.98 GiB (8.50 BPW) 
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 2560
print_info: n_layer          = 36
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 9728
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_finetuned   = unknown
print_info: model type       = 4B
print_info: model params     = 4.02 B
print_info: general.name     = jan-v1-2509
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors:          CPU model buffer size =  4076.43 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.59 MiB
llama_kv_cache:        CPU KV buffer size =   576.00 MiB
llama_kv_cache: size =  576.00 MiB (  4096 cells,  36 layers,  1/1 seqs), K (f16):  288.00 MiB, V (f16):  288.00 MiB
llama_context:        CPU compute buffer size =   301.75 MiB
llama_context: graph nodes  = 1267
llama_context: graph splits = 1
time=2025-10-11T18:57:07.945+08:00 level=INFO source=server.go:1309 msg="llama runner started in 3.08 seconds"
time=2025-10-11T18:57:07.945+08:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-10-11T18:57:07.945+08:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding"
time=2025-10-11T18:57:07.945+08:00 level=INFO source=server.go:1309 msg="llama runner started in 3.08 seconds"
[GIN] 2025/10/11 - 18:57:47 | 200 |   42.3814128s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/10/11 - 18:58:31 | 200 |     10.8172ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/10/11 - 18:58:36 | 200 |     62.0509ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/10/11 - 18:58:36 | 200 |     56.9075ms |       127.0.0.1 | POST     "/api/show"
time=2025-10-11T18:58:36.312+08:00 level=INFO source=sched.go:544 msg="updated VRAM based on existing loaded models" gpu=0 library=cpu total="63.7 GiB" available="41.1 GiB"
time=2025-10-11T18:58:36.378+08:00 level=INFO source=cpu_windows.go:139 msg=packages count=1
time=2025-10-11T18:58:36.378+08:00 level=INFO source=cpu_windows.go:155 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-10-11T18:58:36.378+08:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=24 efficiency=16 threads=32
time=2025-10-11T18:58:36.378+08:00 level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-10-11T18:58:36.381+08:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\fouae\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\AIPro\\OllamaFiles\\models\\blobs\\sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 13065"
time=2025-10-11T18:58:36.390+08:00 level=INFO source=server.go:675 msg="loading model" "model layers"=25 requested=-1
time=2025-10-11T18:58:36.390+08:00 level=INFO source=cpu_windows.go:139 msg=packages count=1
time=2025-10-11T18:58:36.390+08:00 level=INFO source=cpu_windows.go:155 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-10-11T18:58:36.390+08:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=24 efficiency=16 threads=32
time=2025-10-11T18:58:36.390+08:00 level=INFO source=server.go:681 msg="system memory" total="63.7 GiB" free="41.1 GiB" free_swap="43.1 GiB"
time=2025-10-11T18:58:36.427+08:00 level=INFO source=runner.go:1316 msg="starting ollama engine"
time=2025-10-11T18:58:36.436+08:00 level=INFO source=runner.go:1352 msg="Server listening on 127.0.0.1:13065"
time=2025-10-11T18:58:36.444+08:00 level=INFO source=runner.go:1189 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T18:58:36.476+08:00 level=INFO source=ggml.go:133 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
load_backend: loaded CPU backend from C:\Users\fouae\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-10-11T18:58:36.489+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
time=2025-10-11T18:58:36.491+08:00 level=INFO source=runner.go:1189 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T18:58:36.559+08:00 level=INFO source=runner.go:1189 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T18:58:36.559+08:00 level=INFO source=ggml.go:477 msg="offloading 0 repeating layers to GPU"
time=2025-10-11T18:58:36.559+08:00 level=INFO source=ggml.go:481 msg="offloading output layer to CPU"
time=2025-10-11T18:58:36.559+08:00 level=INFO source=ggml.go:488 msg="offloaded 0/25 layers to GPU"
time=2025-10-11T18:58:36.559+08:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="12.8 GiB"
time=2025-10-11T18:58:36.559+08:00 level=INFO source=device.go:222 msg="kv cache" device=CPU size="192.0 MiB"
time=2025-10-11T18:58:36.559+08:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="109.2 MiB"
time=2025-10-11T18:58:36.560+08:00 level=INFO source=device.go:238 msg="total memory" size="13.1 GiB"
time=2025-10-11T18:58:36.560+08:00 level=INFO source=sched.go:481 msg="loaded runners" count=2
time=2025-10-11T18:58:36.560+08:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding"
time=2025-10-11T18:58:36.560+08:00 level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model"
time=2025-10-11T18:58:41.570+08:00 level=INFO source=server.go:1309 msg="llama runner started in 5.19 seconds"
[GIN] 2025/10/11 - 18:59:11 | 200 |   35.4599643s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/10/11 - 19:06:19 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
[GIN] 2025/10/11 - 19:06:19 | 200 |     17.2262ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/10/11 - 19:06:19 | 200 |    197.0747ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/10/11 - 19:06:30 | 200 |     65.6643ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/10/11 - 19:06:30 | 200 |      5.1917ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/10/11 - 19:06:30 | 200 |     59.0817ms |       127.0.0.1 | POST     "/api/show"
time=2025-10-11T19:06:30.498+08:00 level=INFO source=cpu_windows.go:139 msg=packages count=1
time=2025-10-11T19:06:30.498+08:00 level=INFO source=cpu_windows.go:155 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-10-11T19:06:30.498+08:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=24 efficiency=16 threads=32
time=2025-10-11T19:06:30.498+08:00 level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-10-11T19:06:30.501+08:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\fouae\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\AIPro\\OllamaFiles\\models\\blobs\\sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 19452"
time=2025-10-11T19:06:30.509+08:00 level=INFO source=server.go:675 msg="loading model" "model layers"=25 requested=-1
time=2025-10-11T19:06:30.509+08:00 level=INFO source=cpu_windows.go:139 msg=packages count=1
time=2025-10-11T19:06:30.509+08:00 level=INFO source=cpu_windows.go:155 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-10-11T19:06:30.509+08:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=24 efficiency=16 threads=32
time=2025-10-11T19:06:30.509+08:00 level=INFO source=server.go:681 msg="system memory" total="63.7 GiB" free="50.3 GiB" free_swap="47.7 GiB"
time=2025-10-11T19:06:30.554+08:00 level=INFO source=runner.go:1316 msg="starting ollama engine"
time=2025-10-11T19:06:30.562+08:00 level=INFO source=runner.go:1352 msg="Server listening on 127.0.0.1:19452"
time=2025-10-11T19:06:30.563+08:00 level=INFO source=runner.go:1189 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T19:06:30.597+08:00 level=INFO source=ggml.go:133 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
load_backend: loaded CPU backend from C:\Users\fouae\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-10-11T19:06:30.609+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
time=2025-10-11T19:06:30.613+08:00 level=INFO source=runner.go:1189 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T19:06:30.683+08:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="12.8 GiB"
time=2025-10-11T19:06:30.683+08:00 level=INFO source=device.go:222 msg="kv cache" device=CPU size="192.0 MiB"
time=2025-10-11T19:06:30.683+08:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="109.2 MiB"
time=2025-10-11T19:06:30.683+08:00 level=INFO source=device.go:238 msg="total memory" size="13.1 GiB"
time=2025-10-11T19:06:30.683+08:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-10-11T19:06:30.683+08:00 level=INFO source=runner.go:1189 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T19:06:30.683+08:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding"
time=2025-10-11T19:06:30.683+08:00 level=INFO source=ggml.go:477 msg="offloading 0 repeating layers to GPU"
time=2025-10-11T19:06:30.683+08:00 level=INFO source=ggml.go:481 msg="offloading output layer to CPU"
time=2025-10-11T19:06:30.683+08:00 level=INFO source=ggml.go:488 msg="offloaded 0/25 layers to GPU"
time=2025-10-11T19:06:30.683+08:00 level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model"
time=2025-10-11T19:06:32.189+08:00 level=INFO source=server.go:1309 msg="llama runner started in 1.69 seconds"
[GIN] 2025/10/11 - 19:06:41 | 200 |   11.3786655s |       127.0.0.1 | POST     "/api/chat"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.12.5

Originally created by @devrockin on GitHub (Oct 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12574 ### What is the issue? Ollama doesn't use NVIDIA RTX4080 GPU after updated to v0.12.5 or reinstalled. It worked on v0.12.3. Here is a server.log. ### Relevant log output ```shell time=2025-10-11T18:56:40.321+08:00 level=INFO source=routes.go:1481 msg="server config" env="map[CUDA_VISIBLE_DEVICES:GPU-6026f538-f122-eb94-1388-5d1156d22bff GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY:cuda OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\AIPro\\OllamaFiles\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-10-11T18:56:40.341+08:00 level=INFO source=images.go:522 msg="total blobs: 104" time=2025-10-11T18:56:40.345+08:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0" time=2025-10-11T18:56:40.349+08:00 level=INFO source=routes.go:1534 msg="Listening on [::]:11434 (version 0.12.5)" time=2025-10-11T18:56:40.350+08:00 level=INFO source=runner.go:80 msg="discovering available GPUs..." time=2025-10-11T18:56:40.350+08:00 level=INFO source=types.go:129 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="63.7 GiB" available="43.7 GiB" time=2025-10-11T18:56:40.350+08:00 level=INFO source=routes.go:1575 msg="entering low vram mode" "total vram"="0 B" threshold="20.0 GiB" [GIN] 2025/10/11 - 18:56:41 | 200 | 593.6µs | 127.0.0.1 | GET "/api/version" [GIN] 2025/10/11 - 18:56:41 | 200 | 8.6026ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/10/11 - 18:56:41 | 200 | 133.0919ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/10/11 - 18:56:54 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2025/10/11 - 18:56:54 | 200 | 5.3169ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/10/11 - 18:57:00 | 200 | 40.6545ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/10/11 - 18:57:04 | 200 | 5.5555ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/10/11 - 18:57:04 | 200 | 30.4073ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/10/11 - 18:57:04 | 200 | 28.2614ms | 127.0.0.1 | POST "/api/show" llama_model_loader: loaded meta data with 33 key-value pairs and 398 tensors from D:\AIPro\OllamaFiles\models\blobs\sha256-b7f3f749aa86d5bf23c399a579382c0f0a52cf97bd5c66df60c4f960905431c2 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = jan-v1-2509 llama_model_loader: - kv 3: general.version str = 240 llama_model_loader: - kv 4: general.basename str = checkpoint llama_model_loader: - kv 5: general.size_label str = 4.0B llama_model_loader: - kv 6: qwen3.block_count u32 = 36 llama_model_loader: - kv 7: qwen3.context_length u32 = 262144 llama_model_loader: - kv 8: qwen3.embedding_length u32 = 2560 llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 9728 llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 5000000.000000 llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: general.finetune str = jan-v1-2509 llama_model_loader: - kv 26: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 27: general.quantization_version u32 = 2 llama_model_loader: - kv 28: general.file_type u32 = 7 llama_model_loader: - kv 29: quantize.imatrix.file str = ./gguf_upload_2/imatrix.gguf llama_model_loader: - kv 30: quantize.imatrix.dataset str = ./qwen_calibration_with_chat.txt llama_model_loader: - kv 31: quantize.imatrix.entries_count u32 = 252 llama_model_loader: - kv 32: quantize.imatrix.chunks_count u32 = 6 llama_model_loader: - type f32: 145 tensors llama_model_loader: - type q8_0: 253 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 3.98 GiB (8.50 BPW) load: printing all EOG tokens: load: - 151643 ('<|endoftext|>') load: - 151645 ('<|im_end|>') load: - 151662 ('<|fim_pad|>') load: - 151663 ('<|repo_name|>') load: - 151664 ('<|file_sep|>') load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 4.02 B print_info: general.name = jan-v1-2509 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-10-11T18:57:04.859+08:00 level=INFO source=cpu_windows.go:139 msg=packages count=1 time=2025-10-11T18:57:04.859+08:00 level=INFO source=cpu_windows.go:155 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-10-11T18:57:04.859+08:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=24 efficiency=16 threads=32 time=2025-10-11T18:57:04.859+08:00 level=INFO source=server.go:216 msg="enabling flash attention" time=2025-10-11T18:57:04.867+08:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\fouae\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\AIPro\\OllamaFiles\\models\\blobs\\sha256-b7f3f749aa86d5bf23c399a579382c0f0a52cf97bd5c66df60c4f960905431c2 --port 27773" time=2025-10-11T18:57:04.874+08:00 level=INFO source=cpu_windows.go:139 msg=packages count=1 time=2025-10-11T18:57:04.874+08:00 level=INFO source=cpu_windows.go:155 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-10-11T18:57:04.874+08:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=24 efficiency=16 threads=32 time=2025-10-11T18:57:04.874+08:00 level=INFO source=server.go:505 msg="system memory" total="63.7 GiB" free="44.7 GiB" free_swap="47.8 GiB" time=2025-10-11T18:57:04.875+08:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=D:\AIPro\OllamaFiles\models\blobs\sha256-b7f3f749aa86d5bf23c399a579382c0f0a52cf97bd5c66df60c4f960905431c2 library=cpu parallel=1 required="0 B" gpus=1 time=2025-10-11T18:57:04.875+08:00 level=INFO source=server.go:545 msg=offload library=cpu layers.requested=-1 layers.model=37 layers.offload=0 layers.split=[] memory.available="[44.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.0 GiB" memory.required.partial="0 B" memory.required.kv="576.0 MiB" memory.required.allocations="[5.0 GiB]" memory.weights.total="4.0 GiB" memory.weights.repeating="3.6 GiB" memory.weights.nonrepeating="394.1 MiB" memory.graph.full="384.0 MiB" memory.graph.partial="384.0 MiB" time=2025-10-11T18:57:04.910+08:00 level=INFO source=runner.go:864 msg="starting go runner" load_backend: loaded CPU backend from C:\Users\fouae\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-10-11T18:57:04.931+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang) time=2025-10-11T18:57:04.932+08:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:27773" time=2025-10-11T18:57:04.940+08:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T18:57:04.940+08:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding" time=2025-10-11T18:57:04.941+08:00 level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 33 key-value pairs and 398 tensors from D:\AIPro\OllamaFiles\models\blobs\sha256-b7f3f749aa86d5bf23c399a579382c0f0a52cf97bd5c66df60c4f960905431c2 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = jan-v1-2509 llama_model_loader: - kv 3: general.version str = 240 llama_model_loader: - kv 4: general.basename str = checkpoint llama_model_loader: - kv 5: general.size_label str = 4.0B llama_model_loader: - kv 6: qwen3.block_count u32 = 36 llama_model_loader: - kv 7: qwen3.context_length u32 = 262144 llama_model_loader: - kv 8: qwen3.embedding_length u32 = 2560 llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 9728 llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 5000000.000000 llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: general.finetune str = jan-v1-2509 llama_model_loader: - kv 26: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 27: general.quantization_version u32 = 2 llama_model_loader: - kv 28: general.file_type u32 = 7 llama_model_loader: - kv 29: quantize.imatrix.file str = ./gguf_upload_2/imatrix.gguf llama_model_loader: - kv 30: quantize.imatrix.dataset str = ./qwen_calibration_with_chat.txt llama_model_loader: - kv 31: quantize.imatrix.entries_count u32 = 252 llama_model_loader: - kv 32: quantize.imatrix.chunks_count u32 = 6 llama_model_loader: - type f32: 145 tensors llama_model_loader: - type q8_0: 253 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 3.98 GiB (8.50 BPW) load: printing all EOG tokens: load: - 151643 ('<|endoftext|>') load: - 151645 ('<|im_end|>') load: - 151662 ('<|fim_pad|>') load: - 151663 ('<|repo_name|>') load: - 151664 ('<|file_sep|>') load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 0 print_info: n_ctx_train = 262144 print_info: n_embd = 2560 print_info: n_layer = 36 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 9728 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = -1 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 5000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 262144 print_info: rope_finetuned = unknown print_info: model type = 4B print_info: model params = 4.02 B print_info: general.name = jan-v1-2509 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: CPU model buffer size = 4076.43 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = enabled llama_context: kv_unified = false llama_context: freq_base = 5000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.59 MiB llama_kv_cache: CPU KV buffer size = 576.00 MiB llama_kv_cache: size = 576.00 MiB ( 4096 cells, 36 layers, 1/1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB llama_context: CPU compute buffer size = 301.75 MiB llama_context: graph nodes = 1267 llama_context: graph splits = 1 time=2025-10-11T18:57:07.945+08:00 level=INFO source=server.go:1309 msg="llama runner started in 3.08 seconds" time=2025-10-11T18:57:07.945+08:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-10-11T18:57:07.945+08:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding" time=2025-10-11T18:57:07.945+08:00 level=INFO source=server.go:1309 msg="llama runner started in 3.08 seconds" [GIN] 2025/10/11 - 18:57:47 | 200 | 42.3814128s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/10/11 - 18:58:31 | 200 | 10.8172ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/10/11 - 18:58:36 | 200 | 62.0509ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/10/11 - 18:58:36 | 200 | 56.9075ms | 127.0.0.1 | POST "/api/show" time=2025-10-11T18:58:36.312+08:00 level=INFO source=sched.go:544 msg="updated VRAM based on existing loaded models" gpu=0 library=cpu total="63.7 GiB" available="41.1 GiB" time=2025-10-11T18:58:36.378+08:00 level=INFO source=cpu_windows.go:139 msg=packages count=1 time=2025-10-11T18:58:36.378+08:00 level=INFO source=cpu_windows.go:155 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-10-11T18:58:36.378+08:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=24 efficiency=16 threads=32 time=2025-10-11T18:58:36.378+08:00 level=INFO source=server.go:216 msg="enabling flash attention" time=2025-10-11T18:58:36.381+08:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\fouae\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\AIPro\\OllamaFiles\\models\\blobs\\sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 13065" time=2025-10-11T18:58:36.390+08:00 level=INFO source=server.go:675 msg="loading model" "model layers"=25 requested=-1 time=2025-10-11T18:58:36.390+08:00 level=INFO source=cpu_windows.go:139 msg=packages count=1 time=2025-10-11T18:58:36.390+08:00 level=INFO source=cpu_windows.go:155 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-10-11T18:58:36.390+08:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=24 efficiency=16 threads=32 time=2025-10-11T18:58:36.390+08:00 level=INFO source=server.go:681 msg="system memory" total="63.7 GiB" free="41.1 GiB" free_swap="43.1 GiB" time=2025-10-11T18:58:36.427+08:00 level=INFO source=runner.go:1316 msg="starting ollama engine" time=2025-10-11T18:58:36.436+08:00 level=INFO source=runner.go:1352 msg="Server listening on 127.0.0.1:13065" time=2025-10-11T18:58:36.444+08:00 level=INFO source=runner.go:1189 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T18:58:36.476+08:00 level=INFO source=ggml.go:133 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 load_backend: loaded CPU backend from C:\Users\fouae\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-10-11T18:58:36.489+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang) time=2025-10-11T18:58:36.491+08:00 level=INFO source=runner.go:1189 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T18:58:36.559+08:00 level=INFO source=runner.go:1189 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T18:58:36.559+08:00 level=INFO source=ggml.go:477 msg="offloading 0 repeating layers to GPU" time=2025-10-11T18:58:36.559+08:00 level=INFO source=ggml.go:481 msg="offloading output layer to CPU" time=2025-10-11T18:58:36.559+08:00 level=INFO source=ggml.go:488 msg="offloaded 0/25 layers to GPU" time=2025-10-11T18:58:36.559+08:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="12.8 GiB" time=2025-10-11T18:58:36.559+08:00 level=INFO source=device.go:222 msg="kv cache" device=CPU size="192.0 MiB" time=2025-10-11T18:58:36.559+08:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="109.2 MiB" time=2025-10-11T18:58:36.560+08:00 level=INFO source=device.go:238 msg="total memory" size="13.1 GiB" time=2025-10-11T18:58:36.560+08:00 level=INFO source=sched.go:481 msg="loaded runners" count=2 time=2025-10-11T18:58:36.560+08:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding" time=2025-10-11T18:58:36.560+08:00 level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model" time=2025-10-11T18:58:41.570+08:00 level=INFO source=server.go:1309 msg="llama runner started in 5.19 seconds" [GIN] 2025/10/11 - 18:59:11 | 200 | 35.4599643s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/10/11 - 19:06:19 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2025/10/11 - 19:06:19 | 200 | 17.2262ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/10/11 - 19:06:19 | 200 | 197.0747ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/10/11 - 19:06:30 | 200 | 65.6643ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/10/11 - 19:06:30 | 200 | 5.1917ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/10/11 - 19:06:30 | 200 | 59.0817ms | 127.0.0.1 | POST "/api/show" time=2025-10-11T19:06:30.498+08:00 level=INFO source=cpu_windows.go:139 msg=packages count=1 time=2025-10-11T19:06:30.498+08:00 level=INFO source=cpu_windows.go:155 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-10-11T19:06:30.498+08:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=24 efficiency=16 threads=32 time=2025-10-11T19:06:30.498+08:00 level=INFO source=server.go:216 msg="enabling flash attention" time=2025-10-11T19:06:30.501+08:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\fouae\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\AIPro\\OllamaFiles\\models\\blobs\\sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 19452" time=2025-10-11T19:06:30.509+08:00 level=INFO source=server.go:675 msg="loading model" "model layers"=25 requested=-1 time=2025-10-11T19:06:30.509+08:00 level=INFO source=cpu_windows.go:139 msg=packages count=1 time=2025-10-11T19:06:30.509+08:00 level=INFO source=cpu_windows.go:155 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-10-11T19:06:30.509+08:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=24 efficiency=16 threads=32 time=2025-10-11T19:06:30.509+08:00 level=INFO source=server.go:681 msg="system memory" total="63.7 GiB" free="50.3 GiB" free_swap="47.7 GiB" time=2025-10-11T19:06:30.554+08:00 level=INFO source=runner.go:1316 msg="starting ollama engine" time=2025-10-11T19:06:30.562+08:00 level=INFO source=runner.go:1352 msg="Server listening on 127.0.0.1:19452" time=2025-10-11T19:06:30.563+08:00 level=INFO source=runner.go:1189 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T19:06:30.597+08:00 level=INFO source=ggml.go:133 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 load_backend: loaded CPU backend from C:\Users\fouae\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-10-11T19:06:30.609+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang) time=2025-10-11T19:06:30.613+08:00 level=INFO source=runner.go:1189 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T19:06:30.683+08:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="12.8 GiB" time=2025-10-11T19:06:30.683+08:00 level=INFO source=device.go:222 msg="kv cache" device=CPU size="192.0 MiB" time=2025-10-11T19:06:30.683+08:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="109.2 MiB" time=2025-10-11T19:06:30.683+08:00 level=INFO source=device.go:238 msg="total memory" size="13.1 GiB" time=2025-10-11T19:06:30.683+08:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-10-11T19:06:30.683+08:00 level=INFO source=runner.go:1189 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T19:06:30.683+08:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding" time=2025-10-11T19:06:30.683+08:00 level=INFO source=ggml.go:477 msg="offloading 0 repeating layers to GPU" time=2025-10-11T19:06:30.683+08:00 level=INFO source=ggml.go:481 msg="offloading output layer to CPU" time=2025-10-11T19:06:30.683+08:00 level=INFO source=ggml.go:488 msg="offloaded 0/25 layers to GPU" time=2025-10-11T19:06:30.683+08:00 level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model" time=2025-10-11T19:06:32.189+08:00 level=INFO source=server.go:1309 msg="llama runner started in 1.69 seconds" [GIN] 2025/10/11 - 19:06:41 | 200 | 11.3786655s | 127.0.0.1 | POST "/api/chat" ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.12.5
GiteaMirror added the bug label 2026-04-12 20:55:50 -05:00
Author
Owner

@GlisseManTV commented on GitHub (Oct 11, 2025):

Hi!
Just updated to 0.12.5 and afraid to read that.
So, I tried and it work perfectly, like before.
Also, for same settings / models / query, it increased my token/s from 55 to 70 only with updating.
Using 2 x rtx 3060 on ollama desktop Windows on Windows Server 2025.

<!-- gh-comment-id:3393319612 --> @GlisseManTV commented on GitHub (Oct 11, 2025): Hi! Just updated to 0.12.5 and afraid to read that. So, I tried and it work perfectly, like before. Also, for same settings / models / query, it increased my token/s from 55 to 70 only with updating. Using 2 x rtx 3060 on ollama desktop Windows on Windows Server 2025.
Author
Owner

@rick-github commented on GitHub (Oct 11, 2025):

Don't set OLLAMA_LLM_LIBRARY.

<!-- gh-comment-id:3393410946 --> @rick-github commented on GitHub (Oct 11, 2025): Don't set `OLLAMA_LLM_LIBRARY`.
Author
Owner

@devrockin commented on GitHub (Oct 11, 2025):

It works perfectly until I restored some envionment variables for Ollama.Thank you.

<!-- gh-comment-id:3393412476 --> @devrockin commented on GitHub (Oct 11, 2025): It works perfectly until I restored some envionment variables for Ollama.Thank you.
Author
Owner

@Railway9784 commented on GitHub (Oct 12, 2025):

I'm having same issue and my OLLAMA_LLM_LIBRARY is unset. Going back to v0.12.3 fixes it. I'm running Ollama on Windows in WSL (GPU - Nvidia, CPU - AMD, Ollama version 0.12.5)

<!-- gh-comment-id:3393796537 --> @Railway9784 commented on GitHub (Oct 12, 2025): I'm having same issue and my `OLLAMA_LLM_LIBRARY` is unset. Going back to v0.12.3 fixes it. I'm running Ollama on Windows in WSL (GPU - Nvidia, CPU - AMD, Ollama version 0.12.5)
Author
Owner

@rick-github commented on GitHub (Oct 12, 2025):

Server log will help in debugging.

<!-- gh-comment-id:3393799330 --> @rick-github commented on GitHub (Oct 12, 2025): [Server log]( https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will help in debugging.
Author
Owner

@Railway9784 commented on GitHub (Oct 12, 2025):

My apologies! Server log below:

time=2025-10-11T19:23:21.449-07:00 level=INFO source=routes.go:1481 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/mnt/d/Ollama/Models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-10-11T19:23:22.257-07:00 level=INFO source=images.go:522 msg="total blobs: 44"
time=2025-10-11T19:23:22.663-07:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0"
time=2025-10-11T19:23:22.971-07:00 level=INFO source=routes.go:1534 msg="Listening on 127.0.0.1:11434 (version 0.12.5)"
time=2025-10-11T19:23:22.971-07:00 level=DEBUG source=sched.go:122 msg="starting llm scheduler"
time=2025-10-11T19:23:22.972-07:00 level=INFO source=runner.go:80 msg="discovering available GPUs..."
time=2025-10-11T19:23:22.972-07:00 level=DEBUG source=runner.go:411 msg="spawing runner with" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]" extra_envs=[]
time=2025-10-11T19:23:23.606-07:00 level=DEBUG source=runner.go:414 msg="bootstrap discovery took" duration=634.224933ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]" extra_envs=[]
time=2025-10-11T19:23:23.606-07:00 level=DEBUG source=runner.go:411 msg="spawing runner with" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]" extra_envs=[]
time=2025-10-11T19:23:24.087-07:00 level=DEBUG source=runner.go:414 msg="bootstrap discovery took" duration=480.983317ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]" extra_envs=[]
time=2025-10-11T19:23:24.087-07:00 level=DEBUG source=runner.go:117 msg="filtering out unsupported or overlapping GPU library combinations" count=2
time=2025-10-11T19:23:24.088-07:00 level=DEBUG source=runner.go:129 msg="verifying GPU is supported" library=/usr/local/lib/ollama/cuda_v13 description="NVIDIA GeForce GTX 970M" compute=5.2 pci_id=01:00.0
time=2025-10-11T19:23:24.088-07:00 level=DEBUG source=runner.go:129 msg="verifying GPU is supported" library=/usr/local/lib/ollama/cuda_v12 description="NVIDIA GeForce GTX 970M" compute=5.2 pci_id=01:00.0
time=2025-10-11T19:23:24.088-07:00 level=DEBUG source=runner.go:411 msg="spawing runner with" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]" extra_envs="[GGML_CUDA_INIT=1 CUDA_VISIBLE_DEVICES=GPU-00643e8a-080e-de25-fcd1-b1ac9557b901]"
time=2025-10-11T19:23:24.088-07:00 level=DEBUG source=runner.go:411 msg="spawing runner with" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]" extra_envs="[GGML_CUDA_INIT=1 CUDA_VISIBLE_DEVICES=GPU-00643e8a-080e-de25-fcd1-b1ac9557b901]"
time=2025-10-11T19:23:24.630-07:00 level=DEBUG source=runner.go:414 msg="bootstrap discovery took" duration=541.642183ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]" extra_envs="[GGML_CUDA_INIT=1 CUDA_VISIBLE_DEVICES=GPU-00643e8a-080e-de25-fcd1-b1ac9557b901]"
time=2025-10-11T19:23:24.630-07:00 level=DEBUG source=runner.go:414 msg="bootstrap discovery took" duration=541.777817ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]" extra_envs="[GGML_CUDA_INIT=1 CUDA_VISIBLE_DEVICES=GPU-00643e8a-080e-de25-fcd1-b1ac9557b901]"
time=2025-10-11T19:23:24.630-07:00 level=DEBUG source=runner.go:45 msg="GPU bootstrap discovery took" duration=1.658505875s
time=2025-10-11T19:23:24.630-07:00 level=INFO source=types.go:129 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="15.6 GiB" available="15.0 GiB"
time=2025-10-11T19:23:24.630-07:00 level=INFO source=routes.go:1575 msg="entering low vram mode" "total vram"="0 B" threshold="20.0 GiB"
[GIN] 2025/10/11 - 19:33:18 | 200 |    4.309618ms |       127.0.0.1 | HEAD     "/"
[GIN] 2025/10/11 - 19:33:18 | 200 |  442.040755ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/10/11 - 19:34:06 | 200 |      36.075µs |       127.0.0.1 | HEAD     "/"
time=2025-10-11T19:34:07.514-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=general.alignment default=32
[GIN] 2025/10/11 - 19:34:07 | 200 |  800.270141ms |       127.0.0.1 | POST     "/api/show"
time=2025-10-11T19:34:08.575-07:00 level=DEBUG source=runner.go:250 msg="refreshing free memory"
time=2025-10-11T19:34:08.575-07:00 level=DEBUG source=runner.go:45 msg="overall device VRAM discovery took" duration=74.208µs
time=2025-10-11T19:34:08.575-07:00 level=DEBUG source=sched.go:194 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-10-11T19:34:08.699-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=general.alignment default=32
time=2025-10-11T19:34:08.703-07:00 level=DEBUG source=sched.go:214 msg="loading first model" model=/mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                          general.file_type u32              = 7
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q8_0:  198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 1.76 GiB (8.50 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151645 '<|Assistant|>' is not marked as EOG
load: control token: 151644 '<|User|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151647 '<|EOT|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 151643 ('<|end▁of▁sentence|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 1.78 B
print_info: general.name     = DeepSeek R1 Distill Qwen 1.5B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-10-11T19:34:13.043-07:00 level=INFO source=server.go:400 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 --port 43339"
time=2025-10-11T19:34:13.044-07:00 level=DEBUG source=server.go:401 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_MODELS=/mnt/d/Ollama/Models PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/mnt/c/Users/Studio/AppData/Local/Programs/Python/Python313/Scripts/:/mnt/c/Users/Studio/AppData/Local/Programs/Python/Python313/:/mnt/c/Windows/system32:/mnt/c/Windows:/mnt/c/Windows/System32/Wbem:/mnt/c/Windows/System32/WindowsPowerShell/v1.0/:/mnt/c/Windows/System32/OpenSSH/:/mnt/c/Program Files (x86)/NVIDIA Corporation/PhysX/Common:/mnt/c/Program Files (x86)/QuickTime/QTSystem/:/mnt/c/Program Files/dotnet/:/mnt/c/WINDOWS/system32:/mnt/c/WINDOWS:/mnt/c/WINDOWS/System32/Wbem:/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0/:/mnt/c/WINDOWS/System32/OpenSSH/:/mnt/c/Program Files/Antelope Audio/Antelope Launcher/:/mnt/c/Program Files/NVIDIA Corporation/NVIDIA App/NvDLISR:/mnt/c/Program Files/Docker/Docker/resources/bin:/mnt/c/Users/Studio/AppData/Local/Microsoft/WindowsApps:/mnt/c/Users/Studio/AppData/Local/Programs/Ollama:/mnt/c/Users/Studio/.lmstudio/bin:/snap/bin" OLLAMA_KEEP_ALIVE=-1 OLLAMA_MAX_LOADED_MODELS=3 LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama
time=2025-10-11T19:34:13.047-07:00 level=INFO source=server.go:505 msg="system memory" total="15.6 GiB" free="15.0 GiB" free_swap="4.0 GiB"
time=2025-10-11T19:34:13.047-07:00 level=DEBUG source=memory.go:181 msg=evaluating library=cpu gpu_count=1 available="[15.0 GiB]"
time=2025-10-11T19:34:13.047-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=qwen2.vision.block_count default=0
time=2025-10-11T19:34:13.047-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=qwen2.attention.key_length default=128
time=2025-10-11T19:34:13.047-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=qwen2.attention.value_length default=128
time=2025-10-11T19:34:13.048-07:00 level=DEBUG source=ggml.go:610 msg="default cache size estimate" "attention MiB"=112 "attention bytes"=117440512 "recurrent MiB"=0 "recurrent bytes"=0
time=2025-10-11T19:34:13.048-07:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 library=cpu parallel=1 required="0 B" gpus=1
time=2025-10-11T19:34:13.048-07:00 level=DEBUG source=memory.go:181 msg=evaluating library=cpu gpu_count=1 available="[15.0 GiB]"
time=2025-10-11T19:34:13.048-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=qwen2.vision.block_count default=0
time=2025-10-11T19:34:13.048-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=qwen2.attention.key_length default=128
time=2025-10-11T19:34:13.048-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=qwen2.attention.value_length default=128
time=2025-10-11T19:34:13.049-07:00 level=DEBUG source=ggml.go:610 msg="default cache size estimate" "attention MiB"=112 "attention bytes"=117440512 "recurrent MiB"=0 "recurrent bytes"=0
time=2025-10-11T19:34:13.049-07:00 level=INFO source=server.go:545 msg=offload library=cpu layers.requested=-1 layers.model=29 layers.offload=0 layers.split=[] memory.available="[15.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.0 GiB" memory.required.partial="0 B" memory.required.kv="112.0 MiB" memory.required.allocations="[2.0 GiB]" memory.weights.total="1.5 GiB" memory.weights.repeating="1.3 GiB" memory.weights.nonrepeating="236.5 MiB" memory.graph.full="299.8 MiB" memory.graph.partial="482.3 MiB"
time=2025-10-11T19:34:13.081-07:00 level=INFO source=runner.go:864 msg="starting go runner"
time=2025-10-11T19:34:13.081-07:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama
load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
time=2025-10-11T19:34:13.092-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-10-11T19:34:13.099-07:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:43339"
time=2025-10-11T19:34:13.109-07:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:4 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T19:34:13.110-07:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding"
time=2025-10-11T19:34:13.111-07:00 level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                          general.file_type u32              = 7
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q8_0:  198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 1.76 GiB (8.50 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151645 '<|Assistant|>' is not marked as EOG
load: control token: 151644 '<|User|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151647 '<|EOT|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 151643 ('<|end▁of▁sentence|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 1536
print_info: n_layer          = 28
print_info: n_head           = 12
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 6
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8960
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 1.5B
print_info: model params     = 1.78 B
print_info: general.name     = DeepSeek R1 Distill Qwen 1.5B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CPU, is_swa = 0
load_tensors: layer   6 assigned to device CPU, is_swa = 0
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 0
load_tensors: layer   9 assigned to device CPU, is_swa = 0
load_tensors: layer  10 assigned to device CPU, is_swa = 0
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 0
load_tensors: layer  13 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 0
load_tensors: layer  15 assigned to device CPU, is_swa = 0
load_tensors: layer  16 assigned to device CPU, is_swa = 0
load_tensors: layer  17 assigned to device CPU, is_swa = 0
load_tensors: layer  18 assigned to device CPU, is_swa = 0
load_tensors: layer  19 assigned to device CPU, is_swa = 0
load_tensors: layer  20 assigned to device CPU, is_swa = 0
load_tensors: layer  21 assigned to device CPU, is_swa = 0
load_tensors: layer  22 assigned to device CPU, is_swa = 0
load_tensors: layer  23 assigned to device CPU, is_swa = 0
load_tensors: layer  24 assigned to device CPU, is_swa = 0
load_tensors: layer  25 assigned to device CPU, is_swa = 0
load_tensors: layer  26 assigned to device CPU, is_swa = 0
load_tensors: layer  27 assigned to device CPU, is_swa = 0
load_tensors: layer  28 assigned to device CPU, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.attn_q.bias
create_tensor: loading tensor blk.0.attn_k.bias
create_tensor: loading tensor blk.0.attn_v.bias
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight
create_tensor: loading tensor blk.1.attn_k.weight
create_tensor: loading tensor blk.1.attn_v.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.attn_q.bias
create_tensor: loading tensor blk.1.attn_k.bias
create_tensor: loading tensor blk.1.attn_v.bias
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate.weight
create_tensor: loading tensor blk.1.ffn_down.weight
create_tensor: loading tensor blk.1.ffn_up.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_q.weight
create_tensor: loading tensor blk.2.attn_k.weight
create_tensor: loading tensor blk.2.attn_v.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.attn_q.bias
create_tensor: loading tensor blk.2.attn_k.bias
create_tensor: loading tensor blk.2.attn_v.bias
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate.weight
create_tensor: loading tensor blk.2.ffn_down.weight
create_tensor: loading tensor blk.2.ffn_up.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_q.bias
create_tensor: loading tensor blk.3.attn_k.bias
create_tensor: loading tensor blk.3.attn_v.bias
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate.weight
create_tensor: loading tensor blk.3.ffn_down.weight
create_tensor: loading tensor blk.3.ffn_up.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_q.weight
create_tensor: loading tensor blk.4.attn_k.weight
create_tensor: loading tensor blk.4.attn_v.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.attn_q.bias
create_tensor: loading tensor blk.4.attn_k.bias
create_tensor: loading tensor blk.4.attn_v.bias
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate.weight
create_tensor: loading tensor blk.4.ffn_down.weight
create_tensor: loading tensor blk.4.ffn_up.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.attn_q.weight
create_tensor: loading tensor blk.5.attn_k.weight
create_tensor: loading tensor blk.5.attn_v.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.attn_q.bias
create_tensor: loading tensor blk.5.attn_k.bias
create_tensor: loading tensor blk.5.attn_v.bias
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate.weight
create_tensor: loading tensor blk.5.ffn_down.weight
create_tensor: loading tensor blk.5.ffn_up.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_q.weight
create_tensor: loading tensor blk.6.attn_k.weight
create_tensor: loading tensor blk.6.attn_v.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.attn_q.bias
create_tensor: loading tensor blk.6.attn_k.bias
create_tensor: loading tensor blk.6.attn_v.bias
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate.weight
create_tensor: loading tensor blk.6.ffn_down.weight
create_tensor: loading tensor blk.6.ffn_up.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_q.bias
create_tensor: loading tensor blk.7.attn_k.bias
create_tensor: loading tensor blk.7.attn_v.bias
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate.weight
create_tensor: loading tensor blk.7.ffn_down.weight
create_tensor: loading tensor blk.7.ffn_up.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_q.weight
create_tensor: loading tensor blk.8.attn_k.weight
create_tensor: loading tensor blk.8.attn_v.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.attn_q.bias
create_tensor: loading tensor blk.8.attn_k.bias
create_tensor: loading tensor blk.8.attn_v.bias
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate.weight
create_tensor: loading tensor blk.8.ffn_down.weight
create_tensor: loading tensor blk.8.ffn_up.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_q.weight
create_tensor: loading tensor blk.9.attn_k.weight
create_tensor: loading tensor blk.9.attn_v.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.attn_q.bias
create_tensor: loading tensor blk.9.attn_k.bias
create_tensor: loading tensor blk.9.attn_v.bias
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate.weight
create_tensor: loading tensor blk.9.ffn_down.weight
create_tensor: loading tensor blk.9.ffn_up.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_q.weight
create_tensor: loading tensor blk.10.attn_k.weight
create_tensor: loading tensor blk.10.attn_v.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.attn_q.bias
create_tensor: loading tensor blk.10.attn_k.bias
create_tensor: loading tensor blk.10.attn_v.bias
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate.weight
create_tensor: loading tensor blk.10.ffn_down.weight
create_tensor: loading tensor blk.10.ffn_up.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_v.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_q.bias
create_tensor: loading tensor blk.11.attn_k.bias
create_tensor: loading tensor blk.11.attn_v.bias
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate.weight
create_tensor: loading tensor blk.11.ffn_down.weight
create_tensor: loading tensor blk.11.ffn_up.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_q.weight
create_tensor: loading tensor blk.12.attn_k.weight
create_tensor: loading tensor blk.12.attn_v.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.attn_q.bias
create_tensor: loading tensor blk.12.attn_k.bias
create_tensor: loading tensor blk.12.attn_v.bias
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate.weight
create_tensor: loading tensor blk.12.ffn_down.weight
create_tensor: loading tensor blk.12.ffn_up.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_q.weight
create_tensor: loading tensor blk.13.attn_k.weight
create_tensor: loading tensor blk.13.attn_v.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.attn_q.bias
create_tensor: loading tensor blk.13.attn_k.bias
create_tensor: loading tensor blk.13.attn_v.bias
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate.weight
create_tensor: loading tensor blk.13.ffn_down.weight
create_tensor: loading tensor blk.13.ffn_up.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_q.weight
create_tensor: loading tensor blk.14.attn_k.weight
create_tensor: loading tensor blk.14.attn_v.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.attn_q.bias
create_tensor: loading tensor blk.14.attn_k.bias
create_tensor: loading tensor blk.14.attn_v.bias
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate.weight
create_tensor: loading tensor blk.14.ffn_down.weight
create_tensor: loading tensor blk.14.ffn_up.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_q.bias
create_tensor: loading tensor blk.15.attn_k.bias
create_tensor: loading tensor blk.15.attn_v.bias
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate.weight
create_tensor: loading tensor blk.15.ffn_down.weight
create_tensor: loading tensor blk.15.ffn_up.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_q.weight
create_tensor: loading tensor blk.16.attn_k.weight
create_tensor: loading tensor blk.16.attn_v.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.attn_q.bias
create_tensor: loading tensor blk.16.attn_k.bias
create_tensor: loading tensor blk.16.attn_v.bias
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate.weight
create_tensor: loading tensor blk.16.ffn_down.weight
create_tensor: loading tensor blk.16.ffn_up.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.attn_q.weight
create_tensor: loading tensor blk.17.attn_k.weight
create_tensor: loading tensor blk.17.attn_v.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.attn_q.bias
create_tensor: loading tensor blk.17.attn_k.bias
create_tensor: loading tensor blk.17.attn_v.bias
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate.weight
create_tensor: loading tensor blk.17.ffn_down.weight
create_tensor: loading tensor blk.17.ffn_up.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_q.weight
create_tensor: loading tensor blk.18.attn_k.weight
create_tensor: loading tensor blk.18.attn_v.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.attn_q.bias
create_tensor: loading tensor blk.18.attn_k.bias
create_tensor: loading tensor blk.18.attn_v.bias
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor blk.18.ffn_gate.weight
create_tensor: loading tensor blk.18.ffn_down.weight
create_tensor: loading tensor blk.18.ffn_up.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_q.bias
create_tensor: loading tensor blk.19.attn_k.bias
create_tensor: loading tensor blk.19.attn_v.bias
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate.weight
create_tensor: loading tensor blk.19.ffn_down.weight
create_tensor: loading tensor blk.19.ffn_up.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.attn_q.weight
create_tensor: loading tensor blk.20.attn_k.weight
create_tensor: loading tensor blk.20.attn_v.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.attn_q.bias
create_tensor: loading tensor blk.20.attn_k.bias
create_tensor: loading tensor blk.20.attn_v.bias
create_tensor: loading tensor blk.20.ffn_norm.weight
create_tensor: loading tensor blk.20.ffn_gate.weight
create_tensor: loading tensor blk.20.ffn_down.weight
create_tensor: loading tensor blk.20.ffn_up.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.attn_q.weight
create_tensor: loading tensor blk.21.attn_k.weight
create_tensor: loading tensor blk.21.attn_v.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.attn_q.bias
create_tensor: loading tensor blk.21.attn_k.bias
create_tensor: loading tensor blk.21.attn_v.bias
create_tensor: loading tensor blk.21.ffn_norm.weight
create_tensor: loading tensor blk.21.ffn_gate.weight
create_tensor: loading tensor blk.21.ffn_down.weight
create_tensor: loading tensor blk.21.ffn_up.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.attn_q.weight
create_tensor: loading tensor blk.22.attn_k.weight
create_tensor: loading tensor blk.22.attn_v.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.attn_q.bias
create_tensor: loading tensor blk.22.attn_k.bias
create_tensor: loading tensor blk.22.attn_v.bias
create_tensor: loading tensor blk.22.ffn_norm.weight
create_tensor: loading tensor blk.22.ffn_gate.weight
create_tensor: loading tensor blk.22.ffn_down.weight
create_tensor: loading tensor blk.22.ffn_up.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_v.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.attn_q.bias
create_tensor: loading tensor blk.23.attn_k.bias
create_tensor: loading tensor blk.23.attn_v.bias
create_tensor: loading tensor blk.23.ffn_norm.weight
create_tensor: loading tensor blk.23.ffn_gate.weight
create_tensor: loading tensor blk.23.ffn_down.weight
create_tensor: loading tensor blk.23.ffn_up.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.attn_q.weight
create_tensor: loading tensor blk.24.attn_k.weight
create_tensor: loading tensor blk.24.attn_v.weight
create_tensor: loading tensor blk.24.attn_output.weight
create_tensor: loading tensor blk.24.attn_q.bias
create_tensor: loading tensor blk.24.attn_k.bias
create_tensor: loading tensor blk.24.attn_v.bias
create_tensor: loading tensor blk.24.ffn_norm.weight
create_tensor: loading tensor blk.24.ffn_gate.weight
create_tensor: loading tensor blk.24.ffn_down.weight
create_tensor: loading tensor blk.24.ffn_up.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.attn_q.weight
create_tensor: loading tensor blk.25.attn_k.weight
create_tensor: loading tensor blk.25.attn_v.weight
create_tensor: loading tensor blk.25.attn_output.weight
create_tensor: loading tensor blk.25.attn_q.bias
create_tensor: loading tensor blk.25.attn_k.bias
create_tensor: loading tensor blk.25.attn_v.bias
create_tensor: loading tensor blk.25.ffn_norm.weight
create_tensor: loading tensor blk.25.ffn_gate.weight
create_tensor: loading tensor blk.25.ffn_down.weight
create_tensor: loading tensor blk.25.ffn_up.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.attn_q.weight
create_tensor: loading tensor blk.26.attn_k.weight
create_tensor: loading tensor blk.26.attn_v.weight
create_tensor: loading tensor blk.26.attn_output.weight
create_tensor: loading tensor blk.26.attn_q.bias
create_tensor: loading tensor blk.26.attn_k.bias
create_tensor: loading tensor blk.26.attn_v.bias
create_tensor: loading tensor blk.26.ffn_norm.weight
create_tensor: loading tensor blk.26.ffn_gate.weight
create_tensor: loading tensor blk.26.ffn_down.weight
create_tensor: loading tensor blk.26.ffn_up.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.attn_q.weight
create_tensor: loading tensor blk.27.attn_k.weight
create_tensor: loading tensor blk.27.attn_v.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.attn_q.bias
create_tensor: loading tensor blk.27.attn_k.bias
create_tensor: loading tensor blk.27.attn_v.bias
create_tensor: loading tensor blk.27.ffn_norm.weight
create_tensor: loading tensor blk.27.ffn_gate.weight
create_tensor: loading tensor blk.27.ffn_down.weight
create_tensor: loading tensor blk.27.ffn_up.weight
load_tensors:          CPU model buffer size =  1801.09 MiB
load_all_data: no device found for buffer type CPU for async uploads
time=2025-10-11T19:34:17.785-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.13"
time=2025-10-11T19:34:20.557-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.27"
time=2025-10-11T19:34:20.809-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.28"
time=2025-10-11T19:34:21.063-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.29"
time=2025-10-11T19:34:21.315-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.31"
time=2025-10-11T19:34:21.568-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.32"
time=2025-10-11T19:34:21.820-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.33"
time=2025-10-11T19:34:22.072-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.34"
time=2025-10-11T19:34:22.324-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.35"
time=2025-10-11T19:34:22.575-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.36"
time=2025-10-11T19:34:22.827-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.37"
time=2025-10-11T19:34:23.079-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.39"
time=2025-10-11T19:34:23.332-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.40"
time=2025-10-11T19:34:23.586-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.41"
time=2025-10-11T19:34:23.839-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.42"
time=2025-10-11T19:34:24.094-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.43"
time=2025-10-11T19:34:24.348-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.44"
time=2025-10-11T19:34:24.600-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.45"
time=2025-10-11T19:34:24.852-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.47"
time=2025-10-11T19:34:25.103-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.48"
time=2025-10-11T19:34:25.358-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.48"
time=2025-10-11T19:34:25.609-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.50"
time=2025-10-11T19:34:25.864-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.51"
time=2025-10-11T19:34:26.115-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.53"
time=2025-10-11T19:34:26.370-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.54"
time=2025-10-11T19:34:26.622-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.54"
time=2025-10-11T19:34:26.876-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.56"
time=2025-10-11T19:34:27.128-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.57"
time=2025-10-11T19:34:27.382-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.58"
time=2025-10-11T19:34:27.635-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.59"
time=2025-10-11T19:34:27.888-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.61"
time=2025-10-11T19:34:28.141-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.62"
time=2025-10-11T19:34:28.393-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.63"
time=2025-10-11T19:34:28.645-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.64"
time=2025-10-11T19:34:28.896-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.65"
time=2025-10-11T19:34:29.148-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.66"
time=2025-10-11T19:34:29.399-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.68"
time=2025-10-11T19:34:29.653-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.69"
time=2025-10-11T19:34:29.904-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.69"
time=2025-10-11T19:34:30.158-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.71"
time=2025-10-11T19:34:30.411-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.72"
time=2025-10-11T19:34:30.663-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.74"
time=2025-10-11T19:34:30.916-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.75"
time=2025-10-11T19:34:31.168-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.76"
time=2025-10-11T19:34:31.421-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.77"
time=2025-10-11T19:34:31.672-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.77"
time=2025-10-11T19:34:31.924-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.79"
time=2025-10-11T19:34:32.175-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.80"
time=2025-10-11T19:34:32.428-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.82"
time=2025-10-11T19:34:32.679-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.82"
time=2025-10-11T19:34:32.932-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.83"
time=2025-10-11T19:34:33.186-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.85"
time=2025-10-11T19:34:33.438-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.86"
time=2025-10-11T19:34:33.690-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.87"
time=2025-10-11T19:34:33.942-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.88"
time=2025-10-11T19:34:34.196-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.89"
time=2025-10-11T19:34:34.450-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.90"
time=2025-10-11T19:34:34.704-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.91"
time=2025-10-11T19:34:34.959-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.92"
time=2025-10-11T19:34:35.210-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.93"
time=2025-10-11T19:34:35.464-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.95"
time=2025-10-11T19:34:35.715-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.96"
time=2025-10-11T19:34:35.967-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.97"
time=2025-10-11T19:34:36.219-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.98"
time=2025-10-11T19:34:36.473-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.99"
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.59 MiB
create_memory: n_ctx = 4096 (padded)
llama_kv_cache: layer   0: dev = CPU
llama_kv_cache: layer   1: dev = CPU
llama_kv_cache: layer   2: dev = CPU
llama_kv_cache: layer   3: dev = CPU
llama_kv_cache: layer   4: dev = CPU
llama_kv_cache: layer   5: dev = CPU
llama_kv_cache: layer   6: dev = CPU
llama_kv_cache: layer   7: dev = CPU
llama_kv_cache: layer   8: dev = CPU
llama_kv_cache: layer   9: dev = CPU
llama_kv_cache: layer  10: dev = CPU
llama_kv_cache: layer  11: dev = CPU
llama_kv_cache: layer  12: dev = CPU
llama_kv_cache: layer  13: dev = CPU
llama_kv_cache: layer  14: dev = CPU
llama_kv_cache: layer  15: dev = CPU
llama_kv_cache: layer  16: dev = CPU
llama_kv_cache: layer  17: dev = CPU
llama_kv_cache: layer  18: dev = CPU
llama_kv_cache: layer  19: dev = CPU
llama_kv_cache: layer  20: dev = CPU
llama_kv_cache: layer  21: dev = CPU
llama_kv_cache: layer  22: dev = CPU
llama_kv_cache: layer  23: dev = CPU
llama_kv_cache: layer  24: dev = CPU
llama_kv_cache: layer  25: dev = CPU
llama_kv_cache: layer  26: dev = CPU
llama_kv_cache: layer  27: dev = CPU
llama_kv_cache:        CPU KV buffer size =   112.00 MiB
llama_kv_cache: size =  112.00 MiB (  4096 cells,  28 layers,  1/1 seqs), K (f16):   56.00 MiB, V (f16):   56.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 1
llama_context: max_nodes = 2712
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:        CPU compute buffer size =   302.75 MiB
llama_context: graph nodes  = 1098
llama_context: graph splits = 1
time=2025-10-11T19:34:36.725-07:00 level=INFO source=server.go:1309 msg="llama runner started in 25.81 seconds"
time=2025-10-11T19:34:36.725-07:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-10-11T19:34:36.725-07:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding"
time=2025-10-11T19:34:36.726-07:00 level=INFO source=server.go:1309 msg="llama runner started in 25.81 seconds"
time=2025-10-11T19:34:36.726-07:00 level=DEBUG source=sched.go:493 msg="finished setting up" runner.name=registry.ollama.ai/library/deepseek-r1:1.5b-qwen-distill-q8_0 runner.size="2.0 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=4037 runner.model=/mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 runner.num_ctx=4096
[GIN] 2025/10/11 - 19:34:36 | 200 | 31.333971475s |       127.0.0.1 | POST     "/api/generate"
time=2025-10-11T19:34:36.726-07:00 level=DEBUG source=sched.go:501 msg="context for request finished"
time=2025-10-11T19:34:36.727-07:00 level=DEBUG source=sched.go:293 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/deepseek-r1:1.5b-qwen-distill-q8_0 runner.size="2.0 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=4037 runner.model=/mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 runner.num_ctx=4096 duration=2562047h47m16.854775807s
time=2025-10-11T19:34:36.727-07:00 level=DEBUG source=sched.go:311 msg="after processing request finished event" runner.name=registry.ollama.ai/library/deepseek-r1:1.5b-qwen-distill-q8_0 runner.size="2.0 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=4037 runner.model=/mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 runner.num_ctx=4096 refCount=0
[GIN] 2025/10/11 - 19:34:56 | 200 |      26.108µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/10/11 - 19:34:56 | 200 |     161.525µs |       127.0.0.1 | GET      "/api/ps"
<!-- gh-comment-id:3393860797 --> @Railway9784 commented on GitHub (Oct 12, 2025): My apologies! Server log below: ```python time=2025-10-11T19:23:21.449-07:00 level=INFO source=routes.go:1481 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/mnt/d/Ollama/Models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-10-11T19:23:22.257-07:00 level=INFO source=images.go:522 msg="total blobs: 44" time=2025-10-11T19:23:22.663-07:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0" time=2025-10-11T19:23:22.971-07:00 level=INFO source=routes.go:1534 msg="Listening on 127.0.0.1:11434 (version 0.12.5)" time=2025-10-11T19:23:22.971-07:00 level=DEBUG source=sched.go:122 msg="starting llm scheduler" time=2025-10-11T19:23:22.972-07:00 level=INFO source=runner.go:80 msg="discovering available GPUs..." time=2025-10-11T19:23:22.972-07:00 level=DEBUG source=runner.go:411 msg="spawing runner with" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]" extra_envs=[] time=2025-10-11T19:23:23.606-07:00 level=DEBUG source=runner.go:414 msg="bootstrap discovery took" duration=634.224933ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]" extra_envs=[] time=2025-10-11T19:23:23.606-07:00 level=DEBUG source=runner.go:411 msg="spawing runner with" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]" extra_envs=[] time=2025-10-11T19:23:24.087-07:00 level=DEBUG source=runner.go:414 msg="bootstrap discovery took" duration=480.983317ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]" extra_envs=[] time=2025-10-11T19:23:24.087-07:00 level=DEBUG source=runner.go:117 msg="filtering out unsupported or overlapping GPU library combinations" count=2 time=2025-10-11T19:23:24.088-07:00 level=DEBUG source=runner.go:129 msg="verifying GPU is supported" library=/usr/local/lib/ollama/cuda_v13 description="NVIDIA GeForce GTX 970M" compute=5.2 pci_id=01:00.0 time=2025-10-11T19:23:24.088-07:00 level=DEBUG source=runner.go:129 msg="verifying GPU is supported" library=/usr/local/lib/ollama/cuda_v12 description="NVIDIA GeForce GTX 970M" compute=5.2 pci_id=01:00.0 time=2025-10-11T19:23:24.088-07:00 level=DEBUG source=runner.go:411 msg="spawing runner with" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]" extra_envs="[GGML_CUDA_INIT=1 CUDA_VISIBLE_DEVICES=GPU-00643e8a-080e-de25-fcd1-b1ac9557b901]" time=2025-10-11T19:23:24.088-07:00 level=DEBUG source=runner.go:411 msg="spawing runner with" OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]" extra_envs="[GGML_CUDA_INIT=1 CUDA_VISIBLE_DEVICES=GPU-00643e8a-080e-de25-fcd1-b1ac9557b901]" time=2025-10-11T19:23:24.630-07:00 level=DEBUG source=runner.go:414 msg="bootstrap discovery took" duration=541.642183ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]" extra_envs="[GGML_CUDA_INIT=1 CUDA_VISIBLE_DEVICES=GPU-00643e8a-080e-de25-fcd1-b1ac9557b901]" time=2025-10-11T19:23:24.630-07:00 level=DEBUG source=runner.go:414 msg="bootstrap discovery took" duration=541.777817ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]" extra_envs="[GGML_CUDA_INIT=1 CUDA_VISIBLE_DEVICES=GPU-00643e8a-080e-de25-fcd1-b1ac9557b901]" time=2025-10-11T19:23:24.630-07:00 level=DEBUG source=runner.go:45 msg="GPU bootstrap discovery took" duration=1.658505875s time=2025-10-11T19:23:24.630-07:00 level=INFO source=types.go:129 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="15.6 GiB" available="15.0 GiB" time=2025-10-11T19:23:24.630-07:00 level=INFO source=routes.go:1575 msg="entering low vram mode" "total vram"="0 B" threshold="20.0 GiB" [GIN] 2025/10/11 - 19:33:18 | 200 | 4.309618ms | 127.0.0.1 | HEAD "/" [GIN] 2025/10/11 - 19:33:18 | 200 | 442.040755ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/10/11 - 19:34:06 | 200 | 36.075µs | 127.0.0.1 | HEAD "/" time=2025-10-11T19:34:07.514-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=general.alignment default=32 [GIN] 2025/10/11 - 19:34:07 | 200 | 800.270141ms | 127.0.0.1 | POST "/api/show" time=2025-10-11T19:34:08.575-07:00 level=DEBUG source=runner.go:250 msg="refreshing free memory" time=2025-10-11T19:34:08.575-07:00 level=DEBUG source=runner.go:45 msg="overall device VRAM discovery took" duration=74.208µs time=2025-10-11T19:34:08.575-07:00 level=DEBUG source=sched.go:194 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-10-11T19:34:08.699-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=general.alignment default=32 time=2025-10-11T19:34:08.703-07:00 level=DEBUG source=sched.go:214 msg="loading first model" model=/mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 1.5B llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen llama_model_loader: - kv 4: general.size_label str = 1.5B llama_model_loader: - kv 5: qwen2.block_count u32 = 28 llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 llama_model_loader: - kv 7: qwen2.embedding_length u32 = 1536 llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 8960 llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 12 llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: general.file_type u32 = 7 llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q8_0: 198 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 1.76 GiB (8.50 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151645 '<|Assistant|>' is not marked as EOG load: control token: 151644 '<|User|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151647 '<|EOT|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: printing all EOG tokens: load: - 151643 ('<|end▁of▁sentence|>') load: - 151662 ('<|fim_pad|>') load: - 151663 ('<|repo_name|>') load: - 151664 ('<|file_sep|>') load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 1.78 B print_info: general.name = DeepSeek R1 Distill Qwen 1.5B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151646 '<|begin▁of▁sentence|>' print_info: EOS token = 151643 '<|end▁of▁sentence|>' print_info: EOT token = 151643 '<|end▁of▁sentence|>' print_info: PAD token = 151643 '<|end▁of▁sentence|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|end▁of▁sentence|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-10-11T19:34:13.043-07:00 level=INFO source=server.go:400 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 --port 43339" time=2025-10-11T19:34:13.044-07:00 level=DEBUG source=server.go:401 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_MODELS=/mnt/d/Ollama/Models PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/mnt/c/Users/Studio/AppData/Local/Programs/Python/Python313/Scripts/:/mnt/c/Users/Studio/AppData/Local/Programs/Python/Python313/:/mnt/c/Windows/system32:/mnt/c/Windows:/mnt/c/Windows/System32/Wbem:/mnt/c/Windows/System32/WindowsPowerShell/v1.0/:/mnt/c/Windows/System32/OpenSSH/:/mnt/c/Program Files (x86)/NVIDIA Corporation/PhysX/Common:/mnt/c/Program Files (x86)/QuickTime/QTSystem/:/mnt/c/Program Files/dotnet/:/mnt/c/WINDOWS/system32:/mnt/c/WINDOWS:/mnt/c/WINDOWS/System32/Wbem:/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0/:/mnt/c/WINDOWS/System32/OpenSSH/:/mnt/c/Program Files/Antelope Audio/Antelope Launcher/:/mnt/c/Program Files/NVIDIA Corporation/NVIDIA App/NvDLISR:/mnt/c/Program Files/Docker/Docker/resources/bin:/mnt/c/Users/Studio/AppData/Local/Microsoft/WindowsApps:/mnt/c/Users/Studio/AppData/Local/Programs/Ollama:/mnt/c/Users/Studio/.lmstudio/bin:/snap/bin" OLLAMA_KEEP_ALIVE=-1 OLLAMA_MAX_LOADED_MODELS=3 LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama time=2025-10-11T19:34:13.047-07:00 level=INFO source=server.go:505 msg="system memory" total="15.6 GiB" free="15.0 GiB" free_swap="4.0 GiB" time=2025-10-11T19:34:13.047-07:00 level=DEBUG source=memory.go:181 msg=evaluating library=cpu gpu_count=1 available="[15.0 GiB]" time=2025-10-11T19:34:13.047-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=qwen2.vision.block_count default=0 time=2025-10-11T19:34:13.047-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=qwen2.attention.key_length default=128 time=2025-10-11T19:34:13.047-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=qwen2.attention.value_length default=128 time=2025-10-11T19:34:13.048-07:00 level=DEBUG source=ggml.go:610 msg="default cache size estimate" "attention MiB"=112 "attention bytes"=117440512 "recurrent MiB"=0 "recurrent bytes"=0 time=2025-10-11T19:34:13.048-07:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 library=cpu parallel=1 required="0 B" gpus=1 time=2025-10-11T19:34:13.048-07:00 level=DEBUG source=memory.go:181 msg=evaluating library=cpu gpu_count=1 available="[15.0 GiB]" time=2025-10-11T19:34:13.048-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=qwen2.vision.block_count default=0 time=2025-10-11T19:34:13.048-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=qwen2.attention.key_length default=128 time=2025-10-11T19:34:13.048-07:00 level=DEBUG source=ggml.go:275 msg="key with type not found" key=qwen2.attention.value_length default=128 time=2025-10-11T19:34:13.049-07:00 level=DEBUG source=ggml.go:610 msg="default cache size estimate" "attention MiB"=112 "attention bytes"=117440512 "recurrent MiB"=0 "recurrent bytes"=0 time=2025-10-11T19:34:13.049-07:00 level=INFO source=server.go:545 msg=offload library=cpu layers.requested=-1 layers.model=29 layers.offload=0 layers.split=[] memory.available="[15.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.0 GiB" memory.required.partial="0 B" memory.required.kv="112.0 MiB" memory.required.allocations="[2.0 GiB]" memory.weights.total="1.5 GiB" memory.weights.repeating="1.3 GiB" memory.weights.nonrepeating="236.5 MiB" memory.graph.full="299.8 MiB" memory.graph.partial="482.3 MiB" time=2025-10-11T19:34:13.081-07:00 level=INFO source=runner.go:864 msg="starting go runner" time=2025-10-11T19:34:13.081-07:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so time=2025-10-11T19:34:13.092-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-10-11T19:34:13.099-07:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:43339" time=2025-10-11T19:34:13.109-07:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:4 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T19:34:13.110-07:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding" time=2025-10-11T19:34:13.111-07:00 level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 1.5B llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen llama_model_loader: - kv 4: general.size_label str = 1.5B llama_model_loader: - kv 5: qwen2.block_count u32 = 28 llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 llama_model_loader: - kv 7: qwen2.embedding_length u32 = 1536 llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 8960 llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 12 llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: general.file_type u32 = 7 llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q8_0: 198 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 1.76 GiB (8.50 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151645 '<|Assistant|>' is not marked as EOG load: control token: 151644 '<|User|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151647 '<|EOT|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: printing all EOG tokens: load: - 151643 ('<|end▁of▁sentence|>') load: - 151662 ('<|fim_pad|>') load: - 151663 ('<|repo_name|>') load: - 151664 ('<|file_sep|>') load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 1536 print_info: n_layer = 28 print_info: n_head = 12 print_info: n_head_kv = 2 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 6 print_info: n_embd_k_gqa = 256 print_info: n_embd_v_gqa = 256 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 8960 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = -1 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: model type = 1.5B print_info: model params = 1.78 B print_info: general.name = DeepSeek R1 Distill Qwen 1.5B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151646 '<|begin▁of▁sentence|>' print_info: EOS token = 151643 '<|end▁of▁sentence|>' print_info: EOT token = 151643 '<|end▁of▁sentence|>' print_info: PAD token = 151643 '<|end▁of▁sentence|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|end▁of▁sentence|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: layer 0 assigned to device CPU, is_swa = 0 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 0 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CPU, is_swa = 0 load_tensors: layer 5 assigned to device CPU, is_swa = 0 load_tensors: layer 6 assigned to device CPU, is_swa = 0 load_tensors: layer 7 assigned to device CPU, is_swa = 0 load_tensors: layer 8 assigned to device CPU, is_swa = 0 load_tensors: layer 9 assigned to device CPU, is_swa = 0 load_tensors: layer 10 assigned to device CPU, is_swa = 0 load_tensors: layer 11 assigned to device CPU, is_swa = 0 load_tensors: layer 12 assigned to device CPU, is_swa = 0 load_tensors: layer 13 assigned to device CPU, is_swa = 0 load_tensors: layer 14 assigned to device CPU, is_swa = 0 load_tensors: layer 15 assigned to device CPU, is_swa = 0 load_tensors: layer 16 assigned to device CPU, is_swa = 0 load_tensors: layer 17 assigned to device CPU, is_swa = 0 load_tensors: layer 18 assigned to device CPU, is_swa = 0 load_tensors: layer 19 assigned to device CPU, is_swa = 0 load_tensors: layer 20 assigned to device CPU, is_swa = 0 load_tensors: layer 21 assigned to device CPU, is_swa = 0 load_tensors: layer 22 assigned to device CPU, is_swa = 0 load_tensors: layer 23 assigned to device CPU, is_swa = 0 load_tensors: layer 24 assigned to device CPU, is_swa = 0 load_tensors: layer 25 assigned to device CPU, is_swa = 0 load_tensors: layer 26 assigned to device CPU, is_swa = 0 load_tensors: layer 27 assigned to device CPU, is_swa = 0 load_tensors: layer 28 assigned to device CPU, is_swa = 0 create_tensor: loading tensor token_embd.weight create_tensor: loading tensor output_norm.weight create_tensor: loading tensor output.weight create_tensor: loading tensor blk.0.attn_norm.weight create_tensor: loading tensor blk.0.attn_q.weight create_tensor: loading tensor blk.0.attn_k.weight create_tensor: loading tensor blk.0.attn_v.weight create_tensor: loading tensor blk.0.attn_output.weight create_tensor: loading tensor blk.0.attn_q.bias create_tensor: loading tensor blk.0.attn_k.bias create_tensor: loading tensor blk.0.attn_v.bias create_tensor: loading tensor blk.0.ffn_norm.weight create_tensor: loading tensor blk.0.ffn_gate.weight create_tensor: loading tensor blk.0.ffn_down.weight create_tensor: loading tensor blk.0.ffn_up.weight create_tensor: loading tensor blk.1.attn_norm.weight create_tensor: loading tensor blk.1.attn_q.weight create_tensor: loading tensor blk.1.attn_k.weight create_tensor: loading tensor blk.1.attn_v.weight create_tensor: loading tensor blk.1.attn_output.weight create_tensor: loading tensor blk.1.attn_q.bias create_tensor: loading tensor blk.1.attn_k.bias create_tensor: loading tensor blk.1.attn_v.bias create_tensor: loading tensor blk.1.ffn_norm.weight create_tensor: loading tensor blk.1.ffn_gate.weight create_tensor: loading tensor blk.1.ffn_down.weight create_tensor: loading tensor blk.1.ffn_up.weight create_tensor: loading tensor blk.2.attn_norm.weight create_tensor: loading tensor blk.2.attn_q.weight create_tensor: loading tensor blk.2.attn_k.weight create_tensor: loading tensor blk.2.attn_v.weight create_tensor: loading tensor blk.2.attn_output.weight create_tensor: loading tensor blk.2.attn_q.bias create_tensor: loading tensor blk.2.attn_k.bias create_tensor: loading tensor blk.2.attn_v.bias create_tensor: loading tensor blk.2.ffn_norm.weight create_tensor: loading tensor blk.2.ffn_gate.weight create_tensor: loading tensor blk.2.ffn_down.weight create_tensor: loading tensor blk.2.ffn_up.weight create_tensor: loading tensor blk.3.attn_norm.weight create_tensor: loading tensor blk.3.attn_q.weight create_tensor: loading tensor blk.3.attn_k.weight create_tensor: loading tensor blk.3.attn_v.weight create_tensor: loading tensor blk.3.attn_output.weight create_tensor: loading tensor blk.3.attn_q.bias create_tensor: loading tensor blk.3.attn_k.bias create_tensor: loading tensor blk.3.attn_v.bias create_tensor: loading tensor blk.3.ffn_norm.weight create_tensor: loading tensor blk.3.ffn_gate.weight create_tensor: loading tensor blk.3.ffn_down.weight create_tensor: loading tensor blk.3.ffn_up.weight create_tensor: loading tensor blk.4.attn_norm.weight create_tensor: loading tensor blk.4.attn_q.weight create_tensor: loading tensor blk.4.attn_k.weight create_tensor: loading tensor blk.4.attn_v.weight create_tensor: loading tensor blk.4.attn_output.weight create_tensor: loading tensor blk.4.attn_q.bias create_tensor: loading tensor blk.4.attn_k.bias create_tensor: loading tensor blk.4.attn_v.bias create_tensor: loading tensor blk.4.ffn_norm.weight create_tensor: loading tensor blk.4.ffn_gate.weight create_tensor: loading tensor blk.4.ffn_down.weight create_tensor: loading tensor blk.4.ffn_up.weight create_tensor: loading tensor blk.5.attn_norm.weight create_tensor: loading tensor blk.5.attn_q.weight create_tensor: loading tensor blk.5.attn_k.weight create_tensor: loading tensor blk.5.attn_v.weight create_tensor: loading tensor blk.5.attn_output.weight create_tensor: loading tensor blk.5.attn_q.bias create_tensor: loading tensor blk.5.attn_k.bias create_tensor: loading tensor blk.5.attn_v.bias create_tensor: loading tensor blk.5.ffn_norm.weight create_tensor: loading tensor blk.5.ffn_gate.weight create_tensor: loading tensor blk.5.ffn_down.weight create_tensor: loading tensor blk.5.ffn_up.weight create_tensor: loading tensor blk.6.attn_norm.weight create_tensor: loading tensor blk.6.attn_q.weight create_tensor: loading tensor blk.6.attn_k.weight create_tensor: loading tensor blk.6.attn_v.weight create_tensor: loading tensor blk.6.attn_output.weight create_tensor: loading tensor blk.6.attn_q.bias create_tensor: loading tensor blk.6.attn_k.bias create_tensor: loading tensor blk.6.attn_v.bias create_tensor: loading tensor blk.6.ffn_norm.weight create_tensor: loading tensor blk.6.ffn_gate.weight create_tensor: loading tensor blk.6.ffn_down.weight create_tensor: loading tensor blk.6.ffn_up.weight create_tensor: loading tensor blk.7.attn_norm.weight create_tensor: loading tensor blk.7.attn_q.weight create_tensor: loading tensor blk.7.attn_k.weight create_tensor: loading tensor blk.7.attn_v.weight create_tensor: loading tensor blk.7.attn_output.weight create_tensor: loading tensor blk.7.attn_q.bias create_tensor: loading tensor blk.7.attn_k.bias create_tensor: loading tensor blk.7.attn_v.bias create_tensor: loading tensor blk.7.ffn_norm.weight create_tensor: loading tensor blk.7.ffn_gate.weight create_tensor: loading tensor blk.7.ffn_down.weight create_tensor: loading tensor blk.7.ffn_up.weight create_tensor: loading tensor blk.8.attn_norm.weight create_tensor: loading tensor blk.8.attn_q.weight create_tensor: loading tensor blk.8.attn_k.weight create_tensor: loading tensor blk.8.attn_v.weight create_tensor: loading tensor blk.8.attn_output.weight create_tensor: loading tensor blk.8.attn_q.bias create_tensor: loading tensor blk.8.attn_k.bias create_tensor: loading tensor blk.8.attn_v.bias create_tensor: loading tensor blk.8.ffn_norm.weight create_tensor: loading tensor blk.8.ffn_gate.weight create_tensor: loading tensor blk.8.ffn_down.weight create_tensor: loading tensor blk.8.ffn_up.weight create_tensor: loading tensor blk.9.attn_norm.weight create_tensor: loading tensor blk.9.attn_q.weight create_tensor: loading tensor blk.9.attn_k.weight create_tensor: loading tensor blk.9.attn_v.weight create_tensor: loading tensor blk.9.attn_output.weight create_tensor: loading tensor blk.9.attn_q.bias create_tensor: loading tensor blk.9.attn_k.bias create_tensor: loading tensor blk.9.attn_v.bias create_tensor: loading tensor blk.9.ffn_norm.weight create_tensor: loading tensor blk.9.ffn_gate.weight create_tensor: loading tensor blk.9.ffn_down.weight create_tensor: loading tensor blk.9.ffn_up.weight create_tensor: loading tensor blk.10.attn_norm.weight create_tensor: loading tensor blk.10.attn_q.weight create_tensor: loading tensor blk.10.attn_k.weight create_tensor: loading tensor blk.10.attn_v.weight create_tensor: loading tensor blk.10.attn_output.weight create_tensor: loading tensor blk.10.attn_q.bias create_tensor: loading tensor blk.10.attn_k.bias create_tensor: loading tensor blk.10.attn_v.bias create_tensor: loading tensor blk.10.ffn_norm.weight create_tensor: loading tensor blk.10.ffn_gate.weight create_tensor: loading tensor blk.10.ffn_down.weight create_tensor: loading tensor blk.10.ffn_up.weight create_tensor: loading tensor blk.11.attn_norm.weight create_tensor: loading tensor blk.11.attn_q.weight create_tensor: loading tensor blk.11.attn_k.weight create_tensor: loading tensor blk.11.attn_v.weight create_tensor: loading tensor blk.11.attn_output.weight create_tensor: loading tensor blk.11.attn_q.bias create_tensor: loading tensor blk.11.attn_k.bias create_tensor: loading tensor blk.11.attn_v.bias create_tensor: loading tensor blk.11.ffn_norm.weight create_tensor: loading tensor blk.11.ffn_gate.weight create_tensor: loading tensor blk.11.ffn_down.weight create_tensor: loading tensor blk.11.ffn_up.weight create_tensor: loading tensor blk.12.attn_norm.weight create_tensor: loading tensor blk.12.attn_q.weight create_tensor: loading tensor blk.12.attn_k.weight create_tensor: loading tensor blk.12.attn_v.weight create_tensor: loading tensor blk.12.attn_output.weight create_tensor: loading tensor blk.12.attn_q.bias create_tensor: loading tensor blk.12.attn_k.bias create_tensor: loading tensor blk.12.attn_v.bias create_tensor: loading tensor blk.12.ffn_norm.weight create_tensor: loading tensor blk.12.ffn_gate.weight create_tensor: loading tensor blk.12.ffn_down.weight create_tensor: loading tensor blk.12.ffn_up.weight create_tensor: loading tensor blk.13.attn_norm.weight create_tensor: loading tensor blk.13.attn_q.weight create_tensor: loading tensor blk.13.attn_k.weight create_tensor: loading tensor blk.13.attn_v.weight create_tensor: loading tensor blk.13.attn_output.weight create_tensor: loading tensor blk.13.attn_q.bias create_tensor: loading tensor blk.13.attn_k.bias create_tensor: loading tensor blk.13.attn_v.bias create_tensor: loading tensor blk.13.ffn_norm.weight create_tensor: loading tensor blk.13.ffn_gate.weight create_tensor: loading tensor blk.13.ffn_down.weight create_tensor: loading tensor blk.13.ffn_up.weight create_tensor: loading tensor blk.14.attn_norm.weight create_tensor: loading tensor blk.14.attn_q.weight create_tensor: loading tensor blk.14.attn_k.weight create_tensor: loading tensor blk.14.attn_v.weight create_tensor: loading tensor blk.14.attn_output.weight create_tensor: loading tensor blk.14.attn_q.bias create_tensor: loading tensor blk.14.attn_k.bias create_tensor: loading tensor blk.14.attn_v.bias create_tensor: loading tensor blk.14.ffn_norm.weight create_tensor: loading tensor blk.14.ffn_gate.weight create_tensor: loading tensor blk.14.ffn_down.weight create_tensor: loading tensor blk.14.ffn_up.weight create_tensor: loading tensor blk.15.attn_norm.weight create_tensor: loading tensor blk.15.attn_q.weight create_tensor: loading tensor blk.15.attn_k.weight create_tensor: loading tensor blk.15.attn_v.weight create_tensor: loading tensor blk.15.attn_output.weight create_tensor: loading tensor blk.15.attn_q.bias create_tensor: loading tensor blk.15.attn_k.bias create_tensor: loading tensor blk.15.attn_v.bias create_tensor: loading tensor blk.15.ffn_norm.weight create_tensor: loading tensor blk.15.ffn_gate.weight create_tensor: loading tensor blk.15.ffn_down.weight create_tensor: loading tensor blk.15.ffn_up.weight create_tensor: loading tensor blk.16.attn_norm.weight create_tensor: loading tensor blk.16.attn_q.weight create_tensor: loading tensor blk.16.attn_k.weight create_tensor: loading tensor blk.16.attn_v.weight create_tensor: loading tensor blk.16.attn_output.weight create_tensor: loading tensor blk.16.attn_q.bias create_tensor: loading tensor blk.16.attn_k.bias create_tensor: loading tensor blk.16.attn_v.bias create_tensor: loading tensor blk.16.ffn_norm.weight create_tensor: loading tensor blk.16.ffn_gate.weight create_tensor: loading tensor blk.16.ffn_down.weight create_tensor: loading tensor blk.16.ffn_up.weight create_tensor: loading tensor blk.17.attn_norm.weight create_tensor: loading tensor blk.17.attn_q.weight create_tensor: loading tensor blk.17.attn_k.weight create_tensor: loading tensor blk.17.attn_v.weight create_tensor: loading tensor blk.17.attn_output.weight create_tensor: loading tensor blk.17.attn_q.bias create_tensor: loading tensor blk.17.attn_k.bias create_tensor: loading tensor blk.17.attn_v.bias create_tensor: loading tensor blk.17.ffn_norm.weight create_tensor: loading tensor blk.17.ffn_gate.weight create_tensor: loading tensor blk.17.ffn_down.weight create_tensor: loading tensor blk.17.ffn_up.weight create_tensor: loading tensor blk.18.attn_norm.weight create_tensor: loading tensor blk.18.attn_q.weight create_tensor: loading tensor blk.18.attn_k.weight create_tensor: loading tensor blk.18.attn_v.weight create_tensor: loading tensor blk.18.attn_output.weight create_tensor: loading tensor blk.18.attn_q.bias create_tensor: loading tensor blk.18.attn_k.bias create_tensor: loading tensor blk.18.attn_v.bias create_tensor: loading tensor blk.18.ffn_norm.weight create_tensor: loading tensor blk.18.ffn_gate.weight create_tensor: loading tensor blk.18.ffn_down.weight create_tensor: loading tensor blk.18.ffn_up.weight create_tensor: loading tensor blk.19.attn_norm.weight create_tensor: loading tensor blk.19.attn_q.weight create_tensor: loading tensor blk.19.attn_k.weight create_tensor: loading tensor blk.19.attn_v.weight create_tensor: loading tensor blk.19.attn_output.weight create_tensor: loading tensor blk.19.attn_q.bias create_tensor: loading tensor blk.19.attn_k.bias create_tensor: loading tensor blk.19.attn_v.bias create_tensor: loading tensor blk.19.ffn_norm.weight create_tensor: loading tensor blk.19.ffn_gate.weight create_tensor: loading tensor blk.19.ffn_down.weight create_tensor: loading tensor blk.19.ffn_up.weight create_tensor: loading tensor blk.20.attn_norm.weight create_tensor: loading tensor blk.20.attn_q.weight create_tensor: loading tensor blk.20.attn_k.weight create_tensor: loading tensor blk.20.attn_v.weight create_tensor: loading tensor blk.20.attn_output.weight create_tensor: loading tensor blk.20.attn_q.bias create_tensor: loading tensor blk.20.attn_k.bias create_tensor: loading tensor blk.20.attn_v.bias create_tensor: loading tensor blk.20.ffn_norm.weight create_tensor: loading tensor blk.20.ffn_gate.weight create_tensor: loading tensor blk.20.ffn_down.weight create_tensor: loading tensor blk.20.ffn_up.weight create_tensor: loading tensor blk.21.attn_norm.weight create_tensor: loading tensor blk.21.attn_q.weight create_tensor: loading tensor blk.21.attn_k.weight create_tensor: loading tensor blk.21.attn_v.weight create_tensor: loading tensor blk.21.attn_output.weight create_tensor: loading tensor blk.21.attn_q.bias create_tensor: loading tensor blk.21.attn_k.bias create_tensor: loading tensor blk.21.attn_v.bias create_tensor: loading tensor blk.21.ffn_norm.weight create_tensor: loading tensor blk.21.ffn_gate.weight create_tensor: loading tensor blk.21.ffn_down.weight create_tensor: loading tensor blk.21.ffn_up.weight create_tensor: loading tensor blk.22.attn_norm.weight create_tensor: loading tensor blk.22.attn_q.weight create_tensor: loading tensor blk.22.attn_k.weight create_tensor: loading tensor blk.22.attn_v.weight create_tensor: loading tensor blk.22.attn_output.weight create_tensor: loading tensor blk.22.attn_q.bias create_tensor: loading tensor blk.22.attn_k.bias create_tensor: loading tensor blk.22.attn_v.bias create_tensor: loading tensor blk.22.ffn_norm.weight create_tensor: loading tensor blk.22.ffn_gate.weight create_tensor: loading tensor blk.22.ffn_down.weight create_tensor: loading tensor blk.22.ffn_up.weight create_tensor: loading tensor blk.23.attn_norm.weight create_tensor: loading tensor blk.23.attn_q.weight create_tensor: loading tensor blk.23.attn_k.weight create_tensor: loading tensor blk.23.attn_v.weight create_tensor: loading tensor blk.23.attn_output.weight create_tensor: loading tensor blk.23.attn_q.bias create_tensor: loading tensor blk.23.attn_k.bias create_tensor: loading tensor blk.23.attn_v.bias create_tensor: loading tensor blk.23.ffn_norm.weight create_tensor: loading tensor blk.23.ffn_gate.weight create_tensor: loading tensor blk.23.ffn_down.weight create_tensor: loading tensor blk.23.ffn_up.weight create_tensor: loading tensor blk.24.attn_norm.weight create_tensor: loading tensor blk.24.attn_q.weight create_tensor: loading tensor blk.24.attn_k.weight create_tensor: loading tensor blk.24.attn_v.weight create_tensor: loading tensor blk.24.attn_output.weight create_tensor: loading tensor blk.24.attn_q.bias create_tensor: loading tensor blk.24.attn_k.bias create_tensor: loading tensor blk.24.attn_v.bias create_tensor: loading tensor blk.24.ffn_norm.weight create_tensor: loading tensor blk.24.ffn_gate.weight create_tensor: loading tensor blk.24.ffn_down.weight create_tensor: loading tensor blk.24.ffn_up.weight create_tensor: loading tensor blk.25.attn_norm.weight create_tensor: loading tensor blk.25.attn_q.weight create_tensor: loading tensor blk.25.attn_k.weight create_tensor: loading tensor blk.25.attn_v.weight create_tensor: loading tensor blk.25.attn_output.weight create_tensor: loading tensor blk.25.attn_q.bias create_tensor: loading tensor blk.25.attn_k.bias create_tensor: loading tensor blk.25.attn_v.bias create_tensor: loading tensor blk.25.ffn_norm.weight create_tensor: loading tensor blk.25.ffn_gate.weight create_tensor: loading tensor blk.25.ffn_down.weight create_tensor: loading tensor blk.25.ffn_up.weight create_tensor: loading tensor blk.26.attn_norm.weight create_tensor: loading tensor blk.26.attn_q.weight create_tensor: loading tensor blk.26.attn_k.weight create_tensor: loading tensor blk.26.attn_v.weight create_tensor: loading tensor blk.26.attn_output.weight create_tensor: loading tensor blk.26.attn_q.bias create_tensor: loading tensor blk.26.attn_k.bias create_tensor: loading tensor blk.26.attn_v.bias create_tensor: loading tensor blk.26.ffn_norm.weight create_tensor: loading tensor blk.26.ffn_gate.weight create_tensor: loading tensor blk.26.ffn_down.weight create_tensor: loading tensor blk.26.ffn_up.weight create_tensor: loading tensor blk.27.attn_norm.weight create_tensor: loading tensor blk.27.attn_q.weight create_tensor: loading tensor blk.27.attn_k.weight create_tensor: loading tensor blk.27.attn_v.weight create_tensor: loading tensor blk.27.attn_output.weight create_tensor: loading tensor blk.27.attn_q.bias create_tensor: loading tensor blk.27.attn_k.bias create_tensor: loading tensor blk.27.attn_v.bias create_tensor: loading tensor blk.27.ffn_norm.weight create_tensor: loading tensor blk.27.ffn_gate.weight create_tensor: loading tensor blk.27.ffn_down.weight create_tensor: loading tensor blk.27.ffn_up.weight load_tensors: CPU model buffer size = 1801.09 MiB load_all_data: no device found for buffer type CPU for async uploads time=2025-10-11T19:34:17.785-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.13" time=2025-10-11T19:34:20.557-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.27" time=2025-10-11T19:34:20.809-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.28" time=2025-10-11T19:34:21.063-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.29" time=2025-10-11T19:34:21.315-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.31" time=2025-10-11T19:34:21.568-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.32" time=2025-10-11T19:34:21.820-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.33" time=2025-10-11T19:34:22.072-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.34" time=2025-10-11T19:34:22.324-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.35" time=2025-10-11T19:34:22.575-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.36" time=2025-10-11T19:34:22.827-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.37" time=2025-10-11T19:34:23.079-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.39" time=2025-10-11T19:34:23.332-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.40" time=2025-10-11T19:34:23.586-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.41" time=2025-10-11T19:34:23.839-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.42" time=2025-10-11T19:34:24.094-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.43" time=2025-10-11T19:34:24.348-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.44" time=2025-10-11T19:34:24.600-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.45" time=2025-10-11T19:34:24.852-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.47" time=2025-10-11T19:34:25.103-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.48" time=2025-10-11T19:34:25.358-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.48" time=2025-10-11T19:34:25.609-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.50" time=2025-10-11T19:34:25.864-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.51" time=2025-10-11T19:34:26.115-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.53" time=2025-10-11T19:34:26.370-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.54" time=2025-10-11T19:34:26.622-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.54" time=2025-10-11T19:34:26.876-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.56" time=2025-10-11T19:34:27.128-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.57" time=2025-10-11T19:34:27.382-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.58" time=2025-10-11T19:34:27.635-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.59" time=2025-10-11T19:34:27.888-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.61" time=2025-10-11T19:34:28.141-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.62" time=2025-10-11T19:34:28.393-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.63" time=2025-10-11T19:34:28.645-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.64" time=2025-10-11T19:34:28.896-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.65" time=2025-10-11T19:34:29.148-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.66" time=2025-10-11T19:34:29.399-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.68" time=2025-10-11T19:34:29.653-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.69" time=2025-10-11T19:34:29.904-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.69" time=2025-10-11T19:34:30.158-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.71" time=2025-10-11T19:34:30.411-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.72" time=2025-10-11T19:34:30.663-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.74" time=2025-10-11T19:34:30.916-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.75" time=2025-10-11T19:34:31.168-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.76" time=2025-10-11T19:34:31.421-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.77" time=2025-10-11T19:34:31.672-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.77" time=2025-10-11T19:34:31.924-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.79" time=2025-10-11T19:34:32.175-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.80" time=2025-10-11T19:34:32.428-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.82" time=2025-10-11T19:34:32.679-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.82" time=2025-10-11T19:34:32.932-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.83" time=2025-10-11T19:34:33.186-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.85" time=2025-10-11T19:34:33.438-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.86" time=2025-10-11T19:34:33.690-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.87" time=2025-10-11T19:34:33.942-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.88" time=2025-10-11T19:34:34.196-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.89" time=2025-10-11T19:34:34.450-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.90" time=2025-10-11T19:34:34.704-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.91" time=2025-10-11T19:34:34.959-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.92" time=2025-10-11T19:34:35.210-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.93" time=2025-10-11T19:34:35.464-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.95" time=2025-10-11T19:34:35.715-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.96" time=2025-10-11T19:34:35.967-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.97" time=2025-10-11T19:34:36.219-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.98" time=2025-10-11T19:34:36.473-07:00 level=DEBUG source=server.go:1315 msg="model load progress 0.99" llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = disabled llama_context: kv_unified = false llama_context: freq_base = 10000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized set_abort_callback: call llama_context: CPU output buffer size = 0.59 MiB create_memory: n_ctx = 4096 (padded) llama_kv_cache: layer 0: dev = CPU llama_kv_cache: layer 1: dev = CPU llama_kv_cache: layer 2: dev = CPU llama_kv_cache: layer 3: dev = CPU llama_kv_cache: layer 4: dev = CPU llama_kv_cache: layer 5: dev = CPU llama_kv_cache: layer 6: dev = CPU llama_kv_cache: layer 7: dev = CPU llama_kv_cache: layer 8: dev = CPU llama_kv_cache: layer 9: dev = CPU llama_kv_cache: layer 10: dev = CPU llama_kv_cache: layer 11: dev = CPU llama_kv_cache: layer 12: dev = CPU llama_kv_cache: layer 13: dev = CPU llama_kv_cache: layer 14: dev = CPU llama_kv_cache: layer 15: dev = CPU llama_kv_cache: layer 16: dev = CPU llama_kv_cache: layer 17: dev = CPU llama_kv_cache: layer 18: dev = CPU llama_kv_cache: layer 19: dev = CPU llama_kv_cache: layer 20: dev = CPU llama_kv_cache: layer 21: dev = CPU llama_kv_cache: layer 22: dev = CPU llama_kv_cache: layer 23: dev = CPU llama_kv_cache: layer 24: dev = CPU llama_kv_cache: layer 25: dev = CPU llama_kv_cache: layer 26: dev = CPU llama_kv_cache: layer 27: dev = CPU llama_kv_cache: CPU KV buffer size = 112.00 MiB llama_kv_cache: size = 112.00 MiB ( 4096 cells, 28 layers, 1/1 seqs), K (f16): 56.00 MiB, V (f16): 56.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 1 llama_context: max_nodes = 2712 llama_context: reserving full memory module llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1 graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512 graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1 graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512 llama_context: CPU compute buffer size = 302.75 MiB llama_context: graph nodes = 1098 llama_context: graph splits = 1 time=2025-10-11T19:34:36.725-07:00 level=INFO source=server.go:1309 msg="llama runner started in 25.81 seconds" time=2025-10-11T19:34:36.725-07:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-10-11T19:34:36.725-07:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding" time=2025-10-11T19:34:36.726-07:00 level=INFO source=server.go:1309 msg="llama runner started in 25.81 seconds" time=2025-10-11T19:34:36.726-07:00 level=DEBUG source=sched.go:493 msg="finished setting up" runner.name=registry.ollama.ai/library/deepseek-r1:1.5b-qwen-distill-q8_0 runner.size="2.0 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=4037 runner.model=/mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 runner.num_ctx=4096 [GIN] 2025/10/11 - 19:34:36 | 200 | 31.333971475s | 127.0.0.1 | POST "/api/generate" time=2025-10-11T19:34:36.726-07:00 level=DEBUG source=sched.go:501 msg="context for request finished" time=2025-10-11T19:34:36.727-07:00 level=DEBUG source=sched.go:293 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/deepseek-r1:1.5b-qwen-distill-q8_0 runner.size="2.0 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=4037 runner.model=/mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 runner.num_ctx=4096 duration=2562047h47m16.854775807s time=2025-10-11T19:34:36.727-07:00 level=DEBUG source=sched.go:311 msg="after processing request finished event" runner.name=registry.ollama.ai/library/deepseek-r1:1.5b-qwen-distill-q8_0 runner.size="2.0 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=4037 runner.model=/mnt/d/Ollama/Models/blobs/sha256-b2a6e56f8bf2e0b06de5c4f7a5468a586c319bf9937ca3b3a4799406114a7ef6 runner.num_ctx=4096 refCount=0 [GIN] 2025/10/11 - 19:34:56 | 200 | 26.108µs | 127.0.0.1 | HEAD "/" [GIN] 2025/10/11 - 19:34:56 | 200 | 161.525µs | 127.0.0.1 | GET "/api/ps" ```
Author
Owner

@rick-github commented on GitHub (Oct 12, 2025):

time=2025-10-11T19:23:24.630-07:00 level=DEBUG source=runner.go:414 msg="bootstrap discovery took" duration=541.642183ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]" extra_envs="[GGML_CUDA_INIT=1 CUDA_VISIBLE_DEVICES=GPU-00643e8a-080e-de25-fcd1-b1ac9557b901]"
time=2025-10-11T19:23:24.630-07:00 level=DEBUG source=runner.go:414 msg="bootstrap discovery took" duration=541.777817ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]" extra_envs="[GGML_CUDA_INIT=1 CUDA_VISIBLE_DEVICES=GPU-00643e8a-080e-de25-fcd1-b1ac9557b901]"
time=2025-10-11T19:23:24.630-07:00 level=DEBUG source=runner.go:45 msg="GPU bootstrap discovery took" duration=1.658505875s
time=2025-10-11T19:23:24.630-07:00 level=INFO source=types.go:129 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="15.6 GiB" available="15.0 GiB"

Bootstrap discovery didn't detect the 970. Could you post a log from 0.12.3 so we can compare? Also setting OLLAMA_DEBUG=2 will include more detail which may be useful, but be aware that it will also log the prompt.

<!-- gh-comment-id:3394345805 --> @rick-github commented on GitHub (Oct 12, 2025): ``` time=2025-10-11T19:23:24.630-07:00 level=DEBUG source=runner.go:414 msg="bootstrap discovery took" duration=541.642183ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]" extra_envs="[GGML_CUDA_INIT=1 CUDA_VISIBLE_DEVICES=GPU-00643e8a-080e-de25-fcd1-b1ac9557b901]" time=2025-10-11T19:23:24.630-07:00 level=DEBUG source=runner.go:414 msg="bootstrap discovery took" duration=541.777817ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]" extra_envs="[GGML_CUDA_INIT=1 CUDA_VISIBLE_DEVICES=GPU-00643e8a-080e-de25-fcd1-b1ac9557b901]" time=2025-10-11T19:23:24.630-07:00 level=DEBUG source=runner.go:45 msg="GPU bootstrap discovery took" duration=1.658505875s time=2025-10-11T19:23:24.630-07:00 level=INFO source=types.go:129 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="15.6 GiB" available="15.0 GiB" ``` Bootstrap discovery didn't detect the 970. Could you post a log from 0.12.3 so we can compare? Also setting `OLLAMA_DEBUG=2` will include more detail which may be useful, but be aware that it will also log the prompt.
Author
Owner

@Railway9784 commented on GitHub (Oct 12, 2025):

Here's 0.12.5 log with OLLAMA_DEBUG=2 - ollama.v0.12.5.log

And here's 0.12.3 log with OLLAMA_DEBUG=2 - ollama.v0.12.3.log

<!-- gh-comment-id:3395290156 --> @Railway9784 commented on GitHub (Oct 12, 2025): Here's 0.12.5 log with `OLLAMA_DEBUG=2` - [ollama.v0.12.5.log](https://github.com/user-attachments/files/22873999/ollama.v0.12.5.log) And here's 0.12.3 log with `OLLAMA_DEBUG=2` - [ollama.v0.12.3.log](https://github.com/user-attachments/files/22873998/ollama.v0.12.3.log)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8342