[GH-ISSUE #12912] Something changed after 0.12.6 in memory management in a not good way #70619

New Issue

GiteaMirror · 2026-05-04T22:17:44-05:00

GiteaMirror commented

2026-05-04 22:17:44 -05:00

Originally created by @pjv on GitHub (Nov 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12912

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Looking at a bunch of different recent issues in here, I'm going to guess that many of them are related but due to the widely varying contexts in which we are all running ollama they appear differently.

I'm going to detail the form of what I am seeing on my little M2 macbook with 16GB. Below I'm showing two screengrabs of ollama ps on 0.12.6 and 0.12.9 (.7 and .8 are same behavior as .9) in response to the exact same sequence of events (in this case, asking the same simple question to an agent via opencode. The agent first poses the question with one model and then summarizes the chat with another model).

From my point of view, I consider the behavior of ollama 0.12.6 to be optimal and 0.12.9 to be sub-optimal.

0.12.6:

0.12.9:

Relevant log output

0.12.6 logs:
time=2025-11-02T05:51:35.279-06:00 level=INFO source=routes.go:1511 msg="server config" env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16384 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/pjv/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false http_proxy: https_proxy: no_proxy:]"
time=2025-11-02T05:51:35.284-06:00 level=INFO source=images.go:522 msg="total blobs: 55"
time=2025-11-02T05:51:35.285-06:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0"
time=2025-11-02T05:51:35.286-06:00 level=INFO source=routes.go:1564 msg="Listening on [::]:11434 (version 0.12.6)"
time=2025-11-02T05:51:35.286-06:00 level=INFO source=runner.go:80 msg="discovering available GPUs..."
time=2025-11-02T05:51:35.411-06:00 level=INFO source=types.go:112 msg="inference compute" id=0 library=Metal compute=0.0 name=Metal description="Apple M2" libdirs="" driver=0.0 pci_id=00:00.0 type=discrete total="10.7 GiB" available="10.7 GiB"
time=2025-11-02T05:51:35.411-06:00 level=INFO source=routes.go:1605 msg="entering low vram mode" "total vram"="10.7 GiB" threshold="20.0 GiB"
[GIN] 2025/11/02 - 05:51:35 | 200 |     581.042µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/11/02 - 05:51:35 | 200 |      408.25µs |       127.0.0.1 | GET      "/api/ps"
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_device_init: GPU name:   Apple M2
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW)
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-11-02T05:51:49.374-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --port 51411"
time=2025-11-02T05:51:49.377-06:00 level=INFO source=server.go:505 msg="system memory" total="16.0 GiB" free="10.5 GiB" free_swap="0 B"
time=2025-11-02T05:51:49.377-06:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff library=Metal parallel=1 required="4.6 GiB" gpus=1
time=2025-11-02T05:51:49.377-06:00 level=INFO source=server.go:545 msg=offload library=Metal layers.requested=-1 layers.model=29 layers.offload=29 layers.split=[29] memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.6 GiB" memory.required.partial="4.6 GiB" memory.required.kv="1.8 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.6 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="824.0 MiB"
time=2025-11-02T05:51:49.385-06:00 level=INFO source=runner.go:893 msg="starting go runner"
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.018 sec
ggml_metal_device_init: GPU name:   Apple M2
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
time=2025-11-02T05:51:49.387-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
time=2025-11-02T05:51:49.483-06:00 level=INFO source=runner.go:929 msg="Server listening on 127.0.0.1:51411"
time=2025-11-02T05:51:49.487-06:00 level=INFO source=runner.go:828 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:4 GPULayers:29[ID:0 Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free
time=2025-11-02T05:51:49.487-06:00 level=INFO source=server.go:1272 msg="waiting for llama runner to start responding"
time=2025-11-02T05:51:49.487-06:00 level=INFO source=server.go:1306 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW)
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3072
print_info: n_layer          = 28
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 3B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
load_tensors: Metal_Mapped model buffer size =  1918.35 MiB
llama_init_from_model: model default pooling_type is [0], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: use bfloat         = true
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
llama_context:        CPU  output buffer size =     0.50 MiB
llama_kv_cache:      Metal KV buffer size =  1792.00 MiB
llama_kv_cache: size = 1792.00 MiB ( 16384 cells,  28 layers,  1/1 seqs), K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_context:      Metal compute buffer size =   816.01 MiB
llama_context:        CPU compute buffer size =    42.01 MiB
llama_context: graph nodes  = 1014
llama_context: graph splits = 2
time=2025-11-02T05:51:49.990-06:00 level=INFO source=server.go:1310 msg="llama runner started in 0.62 seconds"
time=2025-11-02T05:51:49.990-06:00 level=INFO source=sched.go:482 msg="loaded runners" count=1
time=2025-11-02T05:51:49.990-06:00 level=INFO source=server.go:1272 msg="waiting for llama runner to start responding"
time=2025-11-02T05:51:49.991-06:00 level=INFO source=server.go:1310 msg="llama runner started in 0.62 seconds"
time=2025-11-02T05:51:50.015-06:00 level=INFO source=sched.go:545 msg="updated VRAM based on existing loaded models" gpu=0 library=Metal total="10.7 GiB" available="6.1 GiB"
time=2025-11-02T05:51:50.072-06:00 level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-11-02T05:51:50.072-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a --port 51423"
time=2025-11-02T05:51:50.074-06:00 level=INFO source=server.go:676 msg="loading model" "model layers"=37 requested=-1
time=2025-11-02T05:51:50.075-06:00 level=INFO source=server.go:682 msg="system memory" total="16.0 GiB" free="7.7 GiB" free_swap="0 B"
time=2025-11-02T05:51:50.075-06:00 level=INFO source=server.go:690 msg="gpu memory" id=0 library=Metal available="6.1 GiB" free="6.1 GiB" minimum="0 B" overhead="0 B"
time=2025-11-02T05:51:50.083-06:00 level=INFO source=runner.go:1332 msg="starting ollama engine"
time=2025-11-02T05:51:50.083-06:00 level=INFO source=runner.go:1367 msg="Server listening on 127.0.0.1:51423"
time=2025-11-02T05:51:50.086-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-02T05:51:50.101-06:00 level=INFO source=ggml.go:134 msg="" architecture=qwen3 file_type=Q8_0 name="Qwen3 4B Instruct 2507" description="" num_tensors=398 num_key_values=33
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.019 sec
ggml_metal_device_init: GPU name:   Apple M2
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
time=2025-11-02T05:51:50.102-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: use bfloat         = true
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2025-11-02T05:51:50.208-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:false KvSize:0 KvCacheType: NumThreads:0 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-02T05:51:50.208-06:00 level=INFO source=device.go:206 msg="model weights" device=Metal size="4.0 GiB"
time=2025-11-02T05:51:50.208-06:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="394.1 MiB"
time=2025-11-02T05:51:50.208-06:00 level=INFO source=device.go:217 msg="kv cache" device=Metal size="3.9 GiB"
time=2025-11-02T05:51:50.208-06:00 level=INFO source=device.go:228 msg="compute graph" device=Metal size="102.5 MiB"
time=2025-11-02T05:51:50.208-06:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="5.0 MiB"
time=2025-11-02T05:51:50.208-06:00 level=INFO source=device.go:238 msg="total memory" size="8.3 GiB"
[GIN] 2025/11/02 - 05:51:50 | 200 |  1.984071375s |  100.116.113.16 | POST     "/v1/chat/completions"
[GIN] 2025/11/02 - 05:51:51 | 200 |  2.254573292s |  100.116.113.16 | POST     "/v1/chat/completions"
time=2025-11-02T05:51:51.319-06:00 level=INFO source=server.go:676 msg="loading model" "model layers"=37 requested=-1
time=2025-11-02T05:51:51.319-06:00 level=INFO source=server.go:682 msg="system memory" total="16.0 GiB" free="6.7 GiB" free_swap="0 B"
time=2025-11-02T05:51:51.319-06:00 level=INFO source=server.go:690 msg="gpu memory" id=0 library=Metal available="10.7 GiB" free="10.7 GiB" minimum="0 B" overhead="0 B"
time=2025-11-02T05:51:51.319-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-02T05:51:51.339-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-02T05:51:51.601-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-02T05:51:51.601-06:00 level=INFO source=ggml.go:480 msg="offloading 36 repeating layers to GPU"
time=2025-11-02T05:51:51.601-06:00 level=INFO source=ggml.go:487 msg="offloading output layer to GPU"
time=2025-11-02T05:51:51.601-06:00 level=INFO source=ggml.go:492 msg="offloaded 37/37 layers to GPU"
time=2025-11-02T05:51:51.601-06:00 level=INFO source=device.go:206 msg="model weights" device=Metal size="4.0 GiB"
time=2025-11-02T05:51:51.601-06:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="394.1 MiB"
time=2025-11-02T05:51:51.601-06:00 level=INFO source=device.go:217 msg="kv cache" device=Metal size="3.9 GiB"
time=2025-11-02T05:51:51.601-06:00 level=INFO source=device.go:228 msg="compute graph" device=Metal size="102.5 MiB"
time=2025-11-02T05:51:51.601-06:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="5.0 MiB"
time=2025-11-02T05:51:51.601-06:00 level=INFO source=device.go:238 msg="total memory" size="8.3 GiB"
time=2025-11-02T05:51:51.601-06:00 level=INFO source=sched.go:482 msg="loaded runners" count=1
time=2025-11-02T05:51:51.601-06:00 level=INFO source=server.go:1272 msg="waiting for llama runner to start responding"
time=2025-11-02T05:51:51.602-06:00 level=INFO source=server.go:1306 msg="waiting for server to become available" status="llm server loading model"
time=2025-11-02T05:51:53.358-06:00 level=INFO source=server.go:1310 msg="llama runner started in 3.29 seconds"
[GIN] 2025/11/02 - 05:52:41 | 200 | 52.939571291s |  100.116.113.16 | POST     "/v1/chat/completions"

0.12.9 logs:
time=2025-11-02T05:46:05.343-06:00 level=INFO source=routes.go:1524 msg="server config" env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16384 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/pjv/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false http_proxy: https_proxy: no_proxy:]"
time=2025-11-02T05:46:05.348-06:00 level=INFO source=images.go:522 msg="total blobs: 55"
time=2025-11-02T05:46:05.349-06:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0"
time=2025-11-02T05:46:05.350-06:00 level=INFO source=routes.go:1577 msg="Listening on [::]:11434 (version 0.12.9)"
time=2025-11-02T05:46:05.350-06:00 level=INFO source=runner.go:76 msg="discovering available GPUs..."
time=2025-11-02T05:46:05.353-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --port 51329"
time=2025-11-02T05:46:05.479-06:00 level=INFO source=types.go:42 msg="inference compute" id=0 filtered_id="" library=Metal compute=0.0 name=Metal description="Apple M2" libdirs="" driver=0.0 pci_id="" type=discrete total="10.7 GiB" available="10.7 GiB"
time=2025-11-02T05:46:05.479-06:00 level=INFO source=routes.go:1618 msg="entering low vram mode" "total vram"="10.7 GiB" threshold="20.0 GiB"
[GIN] 2025/11/02 - 05:46:05 | 200 |     165.833µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/11/02 - 05:46:05 | 200 |         309µs |       127.0.0.1 | GET      "/api/ps"
time=2025-11-02T05:46:31.600-06:00 level=INFO source=server.go:215 msg="enabling flash attention"
time=2025-11-02T05:46:31.601-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a --port 51335"
time=2025-11-02T05:46:31.602-06:00 level=INFO source=server.go:653 msg="loading model" "model layers"=37 requested=-1
time=2025-11-02T05:46:31.602-06:00 level=INFO source=server.go:658 msg="system memory" total="16.0 GiB" free="11.0 GiB" free_swap="0 B"
time=2025-11-02T05:46:31.602-06:00 level=INFO source=server.go:665 msg="gpu memory" id=0 library=Metal available="10.2 GiB" free="10.7 GiB" minimum="512.0 MiB" overhead="0 B"
time=2025-11-02T05:46:31.610-06:00 level=INFO source=runner.go:1349 msg="starting ollama engine"
time=2025-11-02T05:46:31.610-06:00 level=INFO source=runner.go:1384 msg="Server listening on 127.0.0.1:51335"
time=2025-11-02T05:46:31.613-06:00 level=INFO source=runner.go:1222 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-02T05:46:31.627-06:00 level=INFO source=ggml.go:136 msg="" architecture=qwen3 file_type=Q8_0 name="Qwen3 4B Instruct 2507" description="" num_tensors=398 num_key_values=33
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_device_init: GPU name:   Apple M2
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
time=2025-11-02T05:46:31.629-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: use bfloat         = true
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2025-11-02T05:46:31.745-06:00 level=INFO source=runner.go:1222 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-02T05:46:32.037-06:00 level=INFO source=runner.go:1222 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-02T05:46:32.037-06:00 level=INFO source=ggml.go:482 msg="offloading 36 repeating layers to GPU"
time=2025-11-02T05:46:32.037-06:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2025-11-02T05:46:32.037-06:00 level=INFO source=ggml.go:494 msg="offloaded 37/37 layers to GPU"
time=2025-11-02T05:46:32.037-06:00 level=INFO source=device.go:212 msg="model weights" device=Metal size="4.0 GiB"
time=2025-11-02T05:46:32.037-06:00 level=INFO source=device.go:217 msg="model weights" device=CPU size="394.1 MiB"
time=2025-11-02T05:46:32.037-06:00 level=INFO source=device.go:223 msg="kv cache" device=Metal size="3.9 GiB"
time=2025-11-02T05:46:32.037-06:00 level=INFO source=device.go:234 msg="compute graph" device=Metal size="131.5 MiB"
time=2025-11-02T05:46:32.037-06:00 level=INFO source=device.go:239 msg="compute graph" device=CPU size="5.0 MiB"
time=2025-11-02T05:46:32.037-06:00 level=INFO source=device.go:244 msg="total memory" size="8.4 GiB"
time=2025-11-02T05:46:32.037-06:00 level=INFO source=sched.go:493 msg="loaded runners" count=1
time=2025-11-02T05:46:32.037-06:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-11-02T05:46:32.053-06:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
time=2025-11-02T05:46:34.311-06:00 level=INFO source=server.go:1289 msg="llama runner started in 2.71 seconds"
time=2025-11-02T05:46:34.311-06:00 level=INFO source=sched.go:559 msg="updated VRAM based on existing loaded models" gpu=0 library=Metal total="10.7 GiB" available="2.7 GiB"
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.022 sec
ggml_metal_device_init: GPU name:   Apple M2
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW)
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-11-02T05:46:34.652-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --port 51339"
time=2025-11-02T05:46:34.656-06:00 level=INFO source=server.go:470 msg="system memory" total="16.0 GiB" free="7.5 GiB" free_swap="0 B"
time=2025-11-02T05:46:34.656-06:00 level=INFO source=server.go:483 msg="model requires more memory than is currently available, evicting a model to make space" estimate.library="" estimate.layers.requested=0 estimate.layers.model=0 estimate.layers.offload=0 estimate.layers.split=[] estimate.memory.available=[] estimate.memory.gpu_overhead="0 B" estimate.memory.required.full="0 B" estimate.memory.required.partial="0 B" estimate.memory.required.kv="0 B" estimate.memory.required.allocations=[] estimate.memory.weights.total="0 B" estimate.memory.weights.repeating="0 B" estimate.memory.weights.nonrepeating="0 B" estimate.memory.graph.full="0 B" estimate.memory.graph.partial="0 B"
time=2025-11-02T05:46:34.671-06:00 level=INFO source=runner.go:910 msg="starting go runner"
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.021 sec
ggml_metal_device_init: GPU name:   Apple M2
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
time=2025-11-02T05:46:34.673-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
time=2025-11-02T05:46:34.776-06:00 level=INFO source=runner.go:946 msg="Server listening on 127.0.0.1:51339"
[GIN] 2025/11/02 - 05:47:18 | 200 | 47.147693541s |  100.116.113.16 | POST     "/v1/chat/completions"
time=2025-11-02T05:47:18.676-06:00 level=INFO source=server.go:470 msg="system memory" total="16.0 GiB" free="3.0 GiB" free_swap="0 B"
time=2025-11-02T05:47:18.676-06:00 level=INFO source=server.go:522 msg=offload library=Metal layers.requested=-1 layers.model=29 layers.offload=10 layers.split=[10] memory.available="[2.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.1 GiB" memory.required.partial="2.6 GiB" memory.required.kv="1.8 GiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.6 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="824.0 MiB"
time=2025-11-02T05:47:18.677-06:00 level=INFO source=runner.go:845 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:4 GPULayers:10[ID:0 Layers:10(18..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free
time=2025-11-02T05:47:18.677-06:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-11-02T05:47:18.678-06:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW)
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3072
print_info: n_layer          = 28
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 3B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 10 repeating layers to GPU
load_tensors: offloaded 10/29 layers to GPU
load_tensors:          CPU model buffer size =  1330.17 MiB
load_tensors:        Metal model buffer size =   588.19 MiB
llama_init_from_model: model default pooling_type is [0], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: use bfloat         = true
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
llama_context:        CPU  output buffer size =     0.50 MiB
llama_kv_cache:        CPU KV buffer size =  1152.00 MiB
llama_kv_cache:      Metal KV buffer size =   640.00 MiB
llama_kv_cache: size = 1792.00 MiB ( 16384 cells,  28 layers,  1/1 seqs), K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_context:      Metal compute buffer size =   816.01 MiB
llama_context:        CPU compute buffer size =   828.01 MiB
llama_context: graph nodes  = 1014
llama_context: graph splits = 255 (with bs=512), 3 (with bs=1)
time=2025-11-02T05:47:19.683-06:00 level=INFO source=server.go:1289 msg="llama runner started in 45.03 seconds"
time=2025-11-02T05:47:19.683-06:00 level=INFO source=sched.go:493 msg="loaded runners" count=1
time=2025-11-02T05:47:19.683-06:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-11-02T05:47:19.684-06:00 level=INFO source=server.go:1289 msg="llama runner started in 45.03 seconds"
[GIN] 2025/11/02 - 05:47:21 | 200 | 50.164034208s |  100.116.113.16 | POST     "/v1/chat/completions"
[GIN] 2025/11/02 - 05:47:21 | 200 |  3.205107208s |  100.116.113.16 | POST     "/v1/chat/completions"
[GIN] 2025/11/02 - 05:47:22 | 200 | 50.812594833s |  100.116.113.16 | POST     "/v1/chat/completions"

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.12.6 and 0.12.9

Originally created by @pjv on GitHub (Nov 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12912 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Looking at a bunch of different recent issues in here, I'm going to guess that many of them are related but due to the widely varying contexts in which we are all running ollama they appear differently. I'm going to detail the form of what I am seeing on my little M2 macbook with 16GB. Below I'm showing two screengrabs of `ollama ps` on 0.12.6 and 0.12.9 (.7 and .8 are same behavior as .9) in response to the exact same sequence of events (in this case, asking the same simple question to an agent via opencode. The agent first poses the question with one model and then summarizes the chat with another model). From my point of view, I consider the behavior of ollama 0.12.6 to be optimal and 0.12.9 to be sub-optimal. #### 0.12.6: <img width="770" height="166" alt="Image" src="https://github.com/user-attachments/assets/a4582cd9-076c-415c-b5e5-ad2d570a98a5" /> #### 0.12.9: <img width="827" height="157" alt="Image" src="https://github.com/user-attachments/assets/7b98ddc2-733a-44b1-87f1-7765abd53d10" /> ### Relevant log output ```shell 0.12.6 logs: time=2025-11-02T05:51:35.279-06:00 level=INFO source=routes.go:1511 msg="server config" env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16384 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/pjv/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false http_proxy: https_proxy: no_proxy:]" time=2025-11-02T05:51:35.284-06:00 level=INFO source=images.go:522 msg="total blobs: 55" time=2025-11-02T05:51:35.285-06:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0" time=2025-11-02T05:51:35.286-06:00 level=INFO source=routes.go:1564 msg="Listening on [::]:11434 (version 0.12.6)" time=2025-11-02T05:51:35.286-06:00 level=INFO source=runner.go:80 msg="discovering available GPUs..." time=2025-11-02T05:51:35.411-06:00 level=INFO source=types.go:112 msg="inference compute" id=0 library=Metal compute=0.0 name=Metal description="Apple M2" libdirs="" driver=0.0 pci_id=00:00.0 type=discrete total="10.7 GiB" available="10.7 GiB" time=2025-11-02T05:51:35.411-06:00 level=INFO source=routes.go:1605 msg="entering low vram mode" "total vram"="10.7 GiB" threshold="20.0 GiB" [GIN] 2025/11/02 - 05:51:35 | 200 | 581.042µs | 127.0.0.1 | HEAD "/" [GIN] 2025/11/02 - 05:51:35 | 200 | 408.25µs | 127.0.0.1 | GET "/api/ps" ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.006 sec ggml_metal_device_init: GPU name: Apple M2 ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 11453.25 MB llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 1.87 GiB (5.01 BPW) load: printing all EOG tokens: load: - 128001 ('<|end_of_text|>') load: - 128008 ('<|eom_id|>') load: - 128009 ('<|eot_id|>') load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 3.21 B print_info: general.name = Llama 3.2 3B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128001 '<|end_of_text|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128001 '<|end_of_text|>' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-11-02T05:51:49.374-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --port 51411" time=2025-11-02T05:51:49.377-06:00 level=INFO source=server.go:505 msg="system memory" total="16.0 GiB" free="10.5 GiB" free_swap="0 B" time=2025-11-02T05:51:49.377-06:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff library=Metal parallel=1 required="4.6 GiB" gpus=1 time=2025-11-02T05:51:49.377-06:00 level=INFO source=server.go:545 msg=offload library=Metal layers.requested=-1 layers.model=29 layers.offload=29 layers.split=[29] memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.6 GiB" memory.required.partial="4.6 GiB" memory.required.kv="1.8 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.6 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="824.0 MiB" time=2025-11-02T05:51:49.385-06:00 level=INFO source=runner.go:893 msg="starting go runner" ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.018 sec ggml_metal_device_init: GPU name: Apple M2 ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 11453.25 MB time=2025-11-02T05:51:49.387-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) time=2025-11-02T05:51:49.483-06:00 level=INFO source=runner.go:929 msg="Server listening on 127.0.0.1:51411" time=2025-11-02T05:51:49.487-06:00 level=INFO source=runner.go:828 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:4 GPULayers:29[ID:0 Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}" llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free time=2025-11-02T05:51:49.487-06:00 level=INFO source=server.go:1272 msg="waiting for llama runner to start responding" time=2025-11-02T05:51:49.487-06:00 level=INFO source=server.go:1306 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 1.87 GiB (5.01 BPW) load: printing all EOG tokens: load: - 128001 ('<|end_of_text|>') load: - 128008 ('<|eom_id|>') load: - 128009 ('<|eot_id|>') load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 3072 print_info: n_layer = 28 print_info: n_head = 24 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 3 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 8192 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: model type = 3B print_info: model params = 3.21 B print_info: general.name = Llama 3.2 3B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128001 '<|end_of_text|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128001 '<|end_of_text|>' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 28 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 29/29 layers to GPU load_tensors: CPU_Mapped model buffer size = 308.23 MiB load_tensors: Metal_Mapped model buffer size = 1918.35 MiB llama_init_from_model: model default pooling_type is [0], but [-1] was specified llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 16384 llama_context: n_ctx_per_seq = 16384 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = disabled llama_context: kv_unified = false llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M2 ggml_metal_init: use bfloat = true ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true llama_context: CPU output buffer size = 0.50 MiB llama_kv_cache: Metal KV buffer size = 1792.00 MiB llama_kv_cache: size = 1792.00 MiB ( 16384 cells, 28 layers, 1/1 seqs), K (f16): 896.00 MiB, V (f16): 896.00 MiB llama_context: Metal compute buffer size = 816.01 MiB llama_context: CPU compute buffer size = 42.01 MiB llama_context: graph nodes = 1014 llama_context: graph splits = 2 time=2025-11-02T05:51:49.990-06:00 level=INFO source=server.go:1310 msg="llama runner started in 0.62 seconds" time=2025-11-02T05:51:49.990-06:00 level=INFO source=sched.go:482 msg="loaded runners" count=1 time=2025-11-02T05:51:49.990-06:00 level=INFO source=server.go:1272 msg="waiting for llama runner to start responding" time=2025-11-02T05:51:49.991-06:00 level=INFO source=server.go:1310 msg="llama runner started in 0.62 seconds" time=2025-11-02T05:51:50.015-06:00 level=INFO source=sched.go:545 msg="updated VRAM based on existing loaded models" gpu=0 library=Metal total="10.7 GiB" available="6.1 GiB" time=2025-11-02T05:51:50.072-06:00 level=INFO source=server.go:216 msg="enabling flash attention" time=2025-11-02T05:51:50.072-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a --port 51423" time=2025-11-02T05:51:50.074-06:00 level=INFO source=server.go:676 msg="loading model" "model layers"=37 requested=-1 time=2025-11-02T05:51:50.075-06:00 level=INFO source=server.go:682 msg="system memory" total="16.0 GiB" free="7.7 GiB" free_swap="0 B" time=2025-11-02T05:51:50.075-06:00 level=INFO source=server.go:690 msg="gpu memory" id=0 library=Metal available="6.1 GiB" free="6.1 GiB" minimum="0 B" overhead="0 B" time=2025-11-02T05:51:50.083-06:00 level=INFO source=runner.go:1332 msg="starting ollama engine" time=2025-11-02T05:51:50.083-06:00 level=INFO source=runner.go:1367 msg="Server listening on 127.0.0.1:51423" time=2025-11-02T05:51:50.086-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-02T05:51:50.101-06:00 level=INFO source=ggml.go:134 msg="" architecture=qwen3 file_type=Q8_0 name="Qwen3 4B Instruct 2507" description="" num_tensors=398 num_key_values=33 ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.019 sec ggml_metal_device_init: GPU name: Apple M2 ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 11453.25 MB time=2025-11-02T05:51:50.102-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M2 ggml_metal_init: use bfloat = true ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true time=2025-11-02T05:51:50.208-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:false KvSize:0 KvCacheType: NumThreads:0 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-02T05:51:50.208-06:00 level=INFO source=device.go:206 msg="model weights" device=Metal size="4.0 GiB" time=2025-11-02T05:51:50.208-06:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="394.1 MiB" time=2025-11-02T05:51:50.208-06:00 level=INFO source=device.go:217 msg="kv cache" device=Metal size="3.9 GiB" time=2025-11-02T05:51:50.208-06:00 level=INFO source=device.go:228 msg="compute graph" device=Metal size="102.5 MiB" time=2025-11-02T05:51:50.208-06:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="5.0 MiB" time=2025-11-02T05:51:50.208-06:00 level=INFO source=device.go:238 msg="total memory" size="8.3 GiB" [GIN] 2025/11/02 - 05:51:50 | 200 | 1.984071375s | 100.116.113.16 | POST "/v1/chat/completions" [GIN] 2025/11/02 - 05:51:51 | 200 | 2.254573292s | 100.116.113.16 | POST "/v1/chat/completions" time=2025-11-02T05:51:51.319-06:00 level=INFO source=server.go:676 msg="loading model" "model layers"=37 requested=-1 time=2025-11-02T05:51:51.319-06:00 level=INFO source=server.go:682 msg="system memory" total="16.0 GiB" free="6.7 GiB" free_swap="0 B" time=2025-11-02T05:51:51.319-06:00 level=INFO source=server.go:690 msg="gpu memory" id=0 library=Metal available="10.7 GiB" free="10.7 GiB" minimum="0 B" overhead="0 B" time=2025-11-02T05:51:51.319-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-02T05:51:51.339-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-02T05:51:51.601-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-02T05:51:51.601-06:00 level=INFO source=ggml.go:480 msg="offloading 36 repeating layers to GPU" time=2025-11-02T05:51:51.601-06:00 level=INFO source=ggml.go:487 msg="offloading output layer to GPU" time=2025-11-02T05:51:51.601-06:00 level=INFO source=ggml.go:492 msg="offloaded 37/37 layers to GPU" time=2025-11-02T05:51:51.601-06:00 level=INFO source=device.go:206 msg="model weights" device=Metal size="4.0 GiB" time=2025-11-02T05:51:51.601-06:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="394.1 MiB" time=2025-11-02T05:51:51.601-06:00 level=INFO source=device.go:217 msg="kv cache" device=Metal size="3.9 GiB" time=2025-11-02T05:51:51.601-06:00 level=INFO source=device.go:228 msg="compute graph" device=Metal size="102.5 MiB" time=2025-11-02T05:51:51.601-06:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="5.0 MiB" time=2025-11-02T05:51:51.601-06:00 level=INFO source=device.go:238 msg="total memory" size="8.3 GiB" time=2025-11-02T05:51:51.601-06:00 level=INFO source=sched.go:482 msg="loaded runners" count=1 time=2025-11-02T05:51:51.601-06:00 level=INFO source=server.go:1272 msg="waiting for llama runner to start responding" time=2025-11-02T05:51:51.602-06:00 level=INFO source=server.go:1306 msg="waiting for server to become available" status="llm server loading model" time=2025-11-02T05:51:53.358-06:00 level=INFO source=server.go:1310 msg="llama runner started in 3.29 seconds" [GIN] 2025/11/02 - 05:52:41 | 200 | 52.939571291s | 100.116.113.16 | POST "/v1/chat/completions" 0.12.9 logs: time=2025-11-02T05:46:05.343-06:00 level=INFO source=routes.go:1524 msg="server config" env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16384 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/pjv/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false http_proxy: https_proxy: no_proxy:]" time=2025-11-02T05:46:05.348-06:00 level=INFO source=images.go:522 msg="total blobs: 55" time=2025-11-02T05:46:05.349-06:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0" time=2025-11-02T05:46:05.350-06:00 level=INFO source=routes.go:1577 msg="Listening on [::]:11434 (version 0.12.9)" time=2025-11-02T05:46:05.350-06:00 level=INFO source=runner.go:76 msg="discovering available GPUs..." time=2025-11-02T05:46:05.353-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --port 51329" time=2025-11-02T05:46:05.479-06:00 level=INFO source=types.go:42 msg="inference compute" id=0 filtered_id="" library=Metal compute=0.0 name=Metal description="Apple M2" libdirs="" driver=0.0 pci_id="" type=discrete total="10.7 GiB" available="10.7 GiB" time=2025-11-02T05:46:05.479-06:00 level=INFO source=routes.go:1618 msg="entering low vram mode" "total vram"="10.7 GiB" threshold="20.0 GiB" [GIN] 2025/11/02 - 05:46:05 | 200 | 165.833µs | 127.0.0.1 | HEAD "/" [GIN] 2025/11/02 - 05:46:05 | 200 | 309µs | 127.0.0.1 | GET "/api/ps" time=2025-11-02T05:46:31.600-06:00 level=INFO source=server.go:215 msg="enabling flash attention" time=2025-11-02T05:46:31.601-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a --port 51335" time=2025-11-02T05:46:31.602-06:00 level=INFO source=server.go:653 msg="loading model" "model layers"=37 requested=-1 time=2025-11-02T05:46:31.602-06:00 level=INFO source=server.go:658 msg="system memory" total="16.0 GiB" free="11.0 GiB" free_swap="0 B" time=2025-11-02T05:46:31.602-06:00 level=INFO source=server.go:665 msg="gpu memory" id=0 library=Metal available="10.2 GiB" free="10.7 GiB" minimum="512.0 MiB" overhead="0 B" time=2025-11-02T05:46:31.610-06:00 level=INFO source=runner.go:1349 msg="starting ollama engine" time=2025-11-02T05:46:31.610-06:00 level=INFO source=runner.go:1384 msg="Server listening on 127.0.0.1:51335" time=2025-11-02T05:46:31.613-06:00 level=INFO source=runner.go:1222 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-02T05:46:31.627-06:00 level=INFO source=ggml.go:136 msg="" architecture=qwen3 file_type=Q8_0 name="Qwen3 4B Instruct 2507" description="" num_tensors=398 num_key_values=33 ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.006 sec ggml_metal_device_init: GPU name: Apple M2 ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 11453.25 MB time=2025-11-02T05:46:31.629-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M2 ggml_metal_init: use bfloat = true ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true time=2025-11-02T05:46:31.745-06:00 level=INFO source=runner.go:1222 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-02T05:46:32.037-06:00 level=INFO source=runner.go:1222 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-02T05:46:32.037-06:00 level=INFO source=ggml.go:482 msg="offloading 36 repeating layers to GPU" time=2025-11-02T05:46:32.037-06:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU" time=2025-11-02T05:46:32.037-06:00 level=INFO source=ggml.go:494 msg="offloaded 37/37 layers to GPU" time=2025-11-02T05:46:32.037-06:00 level=INFO source=device.go:212 msg="model weights" device=Metal size="4.0 GiB" time=2025-11-02T05:46:32.037-06:00 level=INFO source=device.go:217 msg="model weights" device=CPU size="394.1 MiB" time=2025-11-02T05:46:32.037-06:00 level=INFO source=device.go:223 msg="kv cache" device=Metal size="3.9 GiB" time=2025-11-02T05:46:32.037-06:00 level=INFO source=device.go:234 msg="compute graph" device=Metal size="131.5 MiB" time=2025-11-02T05:46:32.037-06:00 level=INFO source=device.go:239 msg="compute graph" device=CPU size="5.0 MiB" time=2025-11-02T05:46:32.037-06:00 level=INFO source=device.go:244 msg="total memory" size="8.4 GiB" time=2025-11-02T05:46:32.037-06:00 level=INFO source=sched.go:493 msg="loaded runners" count=1 time=2025-11-02T05:46:32.037-06:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-11-02T05:46:32.053-06:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" time=2025-11-02T05:46:34.311-06:00 level=INFO source=server.go:1289 msg="llama runner started in 2.71 seconds" time=2025-11-02T05:46:34.311-06:00 level=INFO source=sched.go:559 msg="updated VRAM based on existing loaded models" gpu=0 library=Metal total="10.7 GiB" available="2.7 GiB" ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.022 sec ggml_metal_device_init: GPU name: Apple M2 ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 11453.25 MB llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 1.87 GiB (5.01 BPW) load: printing all EOG tokens: load: - 128001 ('<|end_of_text|>') load: - 128008 ('<|eom_id|>') load: - 128009 ('<|eot_id|>') load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 3.21 B print_info: general.name = Llama 3.2 3B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128001 '<|end_of_text|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128001 '<|end_of_text|>' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-11-02T05:46:34.652-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --port 51339" time=2025-11-02T05:46:34.656-06:00 level=INFO source=server.go:470 msg="system memory" total="16.0 GiB" free="7.5 GiB" free_swap="0 B" time=2025-11-02T05:46:34.656-06:00 level=INFO source=server.go:483 msg="model requires more memory than is currently available, evicting a model to make space" estimate.library="" estimate.layers.requested=0 estimate.layers.model=0 estimate.layers.offload=0 estimate.layers.split=[] estimate.memory.available=[] estimate.memory.gpu_overhead="0 B" estimate.memory.required.full="0 B" estimate.memory.required.partial="0 B" estimate.memory.required.kv="0 B" estimate.memory.required.allocations=[] estimate.memory.weights.total="0 B" estimate.memory.weights.repeating="0 B" estimate.memory.weights.nonrepeating="0 B" estimate.memory.graph.full="0 B" estimate.memory.graph.partial="0 B" time=2025-11-02T05:46:34.671-06:00 level=INFO source=runner.go:910 msg="starting go runner" ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.021 sec ggml_metal_device_init: GPU name: Apple M2 ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 11453.25 MB time=2025-11-02T05:46:34.673-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) time=2025-11-02T05:46:34.776-06:00 level=INFO source=runner.go:946 msg="Server listening on 127.0.0.1:51339" [GIN] 2025/11/02 - 05:47:18 | 200 | 47.147693541s | 100.116.113.16 | POST "/v1/chat/completions" time=2025-11-02T05:47:18.676-06:00 level=INFO source=server.go:470 msg="system memory" total="16.0 GiB" free="3.0 GiB" free_swap="0 B" time=2025-11-02T05:47:18.676-06:00 level=INFO source=server.go:522 msg=offload library=Metal layers.requested=-1 layers.model=29 layers.offload=10 layers.split=[10] memory.available="[2.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.1 GiB" memory.required.partial="2.6 GiB" memory.required.kv="1.8 GiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.6 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="824.0 MiB" time=2025-11-02T05:47:18.677-06:00 level=INFO source=runner.go:845 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:4 GPULayers:10[ID:0 Layers:10(18..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free time=2025-11-02T05:47:18.677-06:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-11-02T05:47:18.678-06:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 1.87 GiB (5.01 BPW) load: printing all EOG tokens: load: - 128001 ('<|end_of_text|>') load: - 128008 ('<|eom_id|>') load: - 128009 ('<|eot_id|>') load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 3072 print_info: n_layer = 28 print_info: n_head = 24 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 3 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 8192 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: model type = 3B print_info: model params = 3.21 B print_info: general.name = Llama 3.2 3B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128001 '<|end_of_text|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128001 '<|end_of_text|>' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: offloading 10 repeating layers to GPU load_tensors: offloaded 10/29 layers to GPU load_tensors: CPU model buffer size = 1330.17 MiB load_tensors: Metal model buffer size = 588.19 MiB llama_init_from_model: model default pooling_type is [0], but [-1] was specified llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 16384 llama_context: n_ctx_per_seq = 16384 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = disabled llama_context: kv_unified = false llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M2 ggml_metal_init: use bfloat = true ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true llama_context: CPU output buffer size = 0.50 MiB llama_kv_cache: CPU KV buffer size = 1152.00 MiB llama_kv_cache: Metal KV buffer size = 640.00 MiB llama_kv_cache: size = 1792.00 MiB ( 16384 cells, 28 layers, 1/1 seqs), K (f16): 896.00 MiB, V (f16): 896.00 MiB llama_context: Metal compute buffer size = 816.01 MiB llama_context: CPU compute buffer size = 828.01 MiB llama_context: graph nodes = 1014 llama_context: graph splits = 255 (with bs=512), 3 (with bs=1) time=2025-11-02T05:47:19.683-06:00 level=INFO source=server.go:1289 msg="llama runner started in 45.03 seconds" time=2025-11-02T05:47:19.683-06:00 level=INFO source=sched.go:493 msg="loaded runners" count=1 time=2025-11-02T05:47:19.683-06:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-11-02T05:47:19.684-06:00 level=INFO source=server.go:1289 msg="llama runner started in 45.03 seconds" [GIN] 2025/11/02 - 05:47:21 | 200 | 50.164034208s | 100.116.113.16 | POST "/v1/chat/completions" [GIN] 2025/11/02 - 05:47:21 | 200 | 3.205107208s | 100.116.113.16 | POST "/v1/chat/completions" [GIN] 2025/11/02 - 05:47:22 | 200 | 50.812594833s | 100.116.113.16 | POST "/v1/chat/completions" ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.12.6 and 0.12.9

GiteaMirror added the macos bug memory labels 2026-05-04 22:17:46 -05:00

GiteaMirror closed this issue

2026-05-04 22:17:49 -05:00

GiteaMirror commented

2026-05-04 22:17:50 -05:00

@StrandmonYellow commented on GitHub (Nov 2, 2025):

I noticed that after version 0.12.6 my ollama slowed wayyyy down. Before i had +/- 20 tokens/s response rate when using llama 3.2 3b model. Now, i have like 0.7 tokens/s. I don't know if it is related this this issue, but it could be.

@StrandmonYellow commented on GitHub (Nov 2, 2025): I noticed that after version 0.12.6 my ollama slowed wayyyy down. Before i had +/- 20 tokens/s response rate when using llama 3.2 3b model. Now, i have like 0.7 tokens/s. I don't know if it is related this this issue, but it could be.

GiteaMirror commented

2026-05-04 22:17:50 -05:00

@lusum commented on GitHub (Nov 2, 2025):

Mistral models too: mistral-small3.1 had 37/38 token/s on 0.11.10 to 29/30 token/s with 0.11.11 until 0.12.6 to 5 tokens/s after 0.12.6. Same with mistral-small3.2 and magistral models

@lusum commented on GitHub (Nov 2, 2025): Mistral models too: mistral-small3.1 had 37/38 token/s on 0.11.10 to 29/30 token/s with 0.11.11 until 0.12.6 to 5 tokens/s after 0.12.6. Same with mistral-small3.2 and magistral models

GiteaMirror commented

2026-05-04 22:17:51 -05:00

@Panican-Whyasker commented on GitHub (Nov 3, 2025):

I observe that, more recently, Ollama has moved to Facebook's development approach, namely
"Move Fast and Break Things" ;)
Not necessarily a bad thing, though. :)

@Panican-Whyasker commented on GitHub (Nov 3, 2025): I observe that, more recently, Ollama has moved to Facebook's development approach, namely "Move Fast and Break Things" ;) Not necessarily a bad thing, though. :)

GiteaMirror commented

2026-05-04 22:17:51 -05:00

@josephlugo commented on GitHub (Nov 3, 2025):

Moving faster and let the community do the QA work for free seems the new approach now.

@josephlugo commented on GitHub (Nov 3, 2025): Moving faster and let the community do the QA work for free seems the new approach now.

GiteaMirror commented

2026-05-04 22:17:53 -05:00

@phrozen commented on GitHub (Nov 3, 2025):

Yeah, I noticed that too, 0.12.5 broke embeddings. They have very little if any QA at all. And that is fine, but the problem is the client wants to auto update ASAP, they should have two different release cycles, edge and stable.

@phrozen commented on GitHub (Nov 3, 2025): Yeah, I noticed that too, 0.12.5 broke embeddings. They have very little if any QA at all. And that is fine, but the problem is the client wants to auto update ASAP, they should have two different release cycles, edge and stable.

GiteaMirror commented

2026-05-04 22:17:55 -05:00

@jessegross commented on GitHub (Nov 3, 2025):

In these logs, there isn't enough memory to load both models at the same time so the first model needs to be evicted. However, when we load the second model (in the most recent version) we don't see the newly freed VRAM. This could be because the first model hasn't actually stopped, as it shows Stopping... in the screen shot, though that depends on when that screen shot was taken. Or it could be that the free memory is not reported correctly.

Can you please post the logs with this problem with OLLAMA_DEBUG=1 set? Possibly also related to #12922

@jessegross commented on GitHub (Nov 3, 2025): In these logs, there isn't enough memory to load both models at the same time so the first model needs to be evicted. However, when we load the second model (in the most recent version) we don't see the newly freed VRAM. This could be because the first model hasn't actually stopped, as it shows Stopping... in the screen shot, though that depends on when that screen shot was taken. Or it could be that the free memory is not reported correctly. Can you please post the logs with this problem with OLLAMA_DEBUG=1 set? Possibly also related to #12922

GiteaMirror commented

2026-05-04 22:17:59 -05:00

@Panican-Whyasker commented on GitHub (Nov 4, 2025):

Yeah, I noticed that too, 0.12.5 broke embeddings. They have very little if any QA at all. And that is fine, but the problem is the client wants to auto update ASAP, they should have two different release cycles, edge and stable.

Maybe you can start a new Feature Request?... I would highly appreciate to avoid the automatic (minor) updates to still unstable or broken versions. Moreover, each update is ~1 Gbyte (for Windows at least), and some ISPs here in Belgium still offer only 150 GB or so per month of fast internet (at my partner's; at my place the limit is 3 TB/mo).

@Panican-Whyasker commented on GitHub (Nov 4, 2025): > Yeah, I noticed that too, 0.12.5 broke embeddings. They have very little if any QA at all. And that is fine, but the problem is the client wants to auto update ASAP, they should have two different release cycles, edge and stable. Maybe you can start a new Feature Request?... I would highly appreciate to avoid the automatic (minor) updates to still unstable or broken versions. Moreover, each update is ~1 Gbyte (for Windows at least), and some ISPs here in Belgium still offer only 150 GB or so per month of fast internet (at my partner's; at my place the limit is 3 TB/mo).

GiteaMirror commented

2026-05-04 22:18:00 -05:00

@pjv commented on GitHub (Nov 4, 2025):

@jessegross

Can you please post the logs with this problem with OLLAMA_DEBUG=1 set?

0.12.6 log

time=2025-11-04T05:29:21.064-06:00 level=DEBUG source=sched.go:123 msg="starting llm scheduler"
time=2025-11-04T05:29:21.064-06:00 level=INFO source=runner.go:80 msg="discovering available GPUs..."
time=2025-11-04T05:29:21.064-06:00 level=DEBUG source=runner.go:448 msg="spawning runner with" OLLAMA_LIBRARY_PATH=[/Applications/Ollama.app/Contents/Resources] extra_envs=[]
time=2025-11-04T05:29:21.202-06:00 level=DEBUG source=runner.go:451 msg="bootstrap discovery took" duration=137.794375ms OLLAMA_LIBRARY_PATH=[/Applications/Ollama.app/Contents/Resources] extra_envs=[]
time=2025-11-04T05:29:21.202-06:00 level=DEBUG source=runner.go:118 msg="filtering out unsupported or overlapping GPU library combinations" count=1
time=2025-11-04T05:29:21.202-06:00 level=DEBUG source=runner.go:45 msg="GPU bootstrap discovery took" duration=138.08875ms
time=2025-11-04T05:29:21.202-06:00 level=INFO source=types.go:112 msg="inference compute" id=0 library=Metal compute=0.0 name=Metal description="Apple M2" libdirs="" driver=0.0 pci_id=00:00.0 type=discrete total="10.7 GiB" available="10.7 GiB"
time=2025-11-04T05:29:21.202-06:00 level=INFO source=routes.go:1605 msg="entering low vram mode" "total vram"="10.7 GiB" threshold="20.0 GiB"
[GIN] 2025/11/04 - 05:29:21 | 200 |     152.333µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/11/04 - 05:29:21 | 200 |     325.583µs |       127.0.0.1 | GET      "/api/ps"
time=2025-11-04T05:30:19.430-06:00 level=DEBUG source=runner.go:45 msg="overall device VRAM discovery took" duration=333ns
time=2025-11-04T05:30:19.430-06:00 level=DEBUG source=sched.go:195 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-11-04T05:30:19.440-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
time=2025-11-04T05:30:19.440-06:00 level=DEBUG source=sched.go:215 msg="loading first model" model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a
time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0
time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.type default=""
time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.factor default=1
time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.original_context_length default=0
time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_count default=0
time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_used_count default=0
time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.norm_top_k_prob default=true
time=2025-11-04T05:30:19.471-06:00 level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-11-04T05:30:19.472-06:00 level=DEBUG source=server.go:331 msg="adding gpu dependency paths" paths=[/Applications/Ollama.app/Contents/Resources]
time=2025-11-04T05:30:19.472-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a --port 52409"
time=2025-11-04T05:30:19.472-06:00 level=DEBUG source=server.go:401 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_HOST=0.0.0.0 OLLAMA_KEEP_ALIVE=-1 OLLAMA_MODELS=/Users/pjv/.ollama/models PATH="/Users/pjv/.local/bin:/Users/pjv/go/bin:/opt/homebrew/sbin:/opt/homebrew/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_MAX_LOADED_MODELS=3 DYLD_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources:/Applications/Ollama.app/Contents/Resources:/Applications/Ollama.app/Contents/Resources OLLAMA_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources:/Applications/Ollama.app/Contents/Resources
time=2025-11-04T05:30:19.473-06:00 level=INFO source=server.go:676 msg="loading model" "model layers"=37 requested=-1
time=2025-11-04T05:30:19.473-06:00 level=INFO source=server.go:682 msg="system memory" total="16.0 GiB" free="6.7 GiB" free_swap="0 B"
time=2025-11-04T05:30:19.473-06:00 level=INFO source=server.go:690 msg="gpu memory" id=0 library=Metal available="10.7 GiB" free="10.7 GiB" minimum="0 B" overhead="0 B"
time=2025-11-04T05:30:19.481-06:00 level=INFO source=runner.go:1332 msg="starting ollama engine"
time=2025-11-04T05:30:19.481-06:00 level=INFO source=runner.go:1367 msg="Server listening on 127.0.0.1:52409"
time=2025-11-04T05:30:19.484-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-04T05:30:19.499-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
time=2025-11-04T05:30:19.499-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.description default=""
time=2025-11-04T05:30:19.499-06:00 level=INFO source=ggml.go:134 msg="" architecture=qwen3 file_type=Q8_0 name="Qwen3 4B Instruct 2507" description="" num_tensors=398 num_key_values=33
time=2025-11-04T05:30:19.499-06:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/Applications/Ollama.app/Contents/Resources
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_device_init: GPU name:   Apple M2
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
time=2025-11-04T05:30:19.500-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: use bfloat         = true
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0
time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.type default=""
time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.factor default=1
time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.original_context_length default=0
time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_count default=0
time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_used_count default=0
time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.norm_top_k_prob default=true
time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=ggml.go:837 msg="compute graph" nodes=1230 splits=2
time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=device.go:206 msg="model weights" device=Metal size="4.0 GiB"
time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=device.go:211 msg="model weights" device=CPU size="394.1 MiB"
time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=device.go:217 msg="kv cache" device=Metal size="3.9 GiB"
time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=device.go:228 msg="compute graph" device=Metal size="102.5 MiB"
time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=device.go:233 msg="compute graph" device=CPU size="5.0 MiB"
time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=device.go:238 msg="total memory" size="8.3 GiB"
time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=server.go:721 msg=memory success=true required.InputWeights=413265920 required.CPU.Graph=5242880 required.Metal.ID=0 required.Metal.Weights="[107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 413276160]" required.Metal.Cache="[115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 0]" required.Metal.Graph=107509280
time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=server.go:915 msg="available gpu" id=0 library=Metal "available layer vram"="10.6 GiB" backoff=0.00 minimum="0 B" overhead="0 B" graph="102.5 MiB"
time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=server.go:732 msg="new layout created" layers="37[ID:0 Layers:37(0..36)]"
time=2025-11-04T05:30:19.586-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-04T05:30:19.600-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0
time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.type default=""
time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.factor default=1
time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.original_context_length default=0
time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_count default=0
time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_used_count default=0
time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.norm_top_k_prob default=true
time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=ggml.go:837 msg="compute graph" nodes=1230 splits=2
time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=device.go:206 msg="model weights" device=Metal size="4.0 GiB"
time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=device.go:211 msg="model weights" device=CPU size="394.1 MiB"
time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=device.go:217 msg="kv cache" device=Metal size="3.9 GiB"
time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=device.go:228 msg="compute graph" device=Metal size="102.5 MiB"
time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=device.go:233 msg="compute graph" device=CPU size="5.0 MiB"
time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=device.go:238 msg="total memory" size="8.3 GiB"
time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=server.go:721 msg=memory success=true required.InputWeights=413265920 required.CPU.Graph=5242880 required.Metal.ID=0 required.Metal.Weights="[107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 413276160]" required.Metal.Cache="[115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 0]" required.Metal.Graph=107509280
time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=server.go:915 msg="available gpu" id=0 library=Metal "available layer vram"="10.6 GiB" backoff=0.00 minimum="0 B" overhead="0 B" graph="102.5 MiB"
time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=server.go:732 msg="new layout created" layers="37[ID:0 Layers:37(0..36)]"
time=2025-11-04T05:30:19.886-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-04T05:30:19.886-06:00 level=INFO source=ggml.go:480 msg="offloading 36 repeating layers to GPU"
time=2025-11-04T05:30:19.886-06:00 level=INFO source=ggml.go:487 msg="offloading output layer to GPU"
time=2025-11-04T05:30:19.886-06:00 level=INFO source=ggml.go:492 msg="offloaded 37/37 layers to GPU"
time=2025-11-04T05:30:19.886-06:00 level=INFO source=device.go:206 msg="model weights" device=Metal size="4.0 GiB"
time=2025-11-04T05:30:19.886-06:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="394.1 MiB"
time=2025-11-04T05:30:19.886-06:00 level=INFO source=device.go:217 msg="kv cache" device=Metal size="3.9 GiB"
time=2025-11-04T05:30:19.886-06:00 level=INFO source=device.go:228 msg="compute graph" device=Metal size="102.5 MiB"
time=2025-11-04T05:30:19.886-06:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="5.0 MiB"
time=2025-11-04T05:30:19.886-06:00 level=INFO source=device.go:238 msg="total memory" size="8.3 GiB"
time=2025-11-04T05:30:19.886-06:00 level=INFO source=sched.go:482 msg="loaded runners" count=1
time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=runner.go:45 msg="overall device VRAM discovery took" duration=250ns
time=2025-11-04T05:30:19.886-06:00 level=INFO source=server.go:1272 msg="waiting for llama runner to start responding"
time=2025-11-04T05:30:19.887-06:00 level=INFO source=server.go:1306 msg="waiting for server to become available" status="llm server loading model"
time=2025-11-04T05:30:19.897-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
time=2025-11-04T05:30:20.138-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.16"
time=2025-11-04T05:30:20.388-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.31"
time=2025-11-04T05:30:20.639-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.47"
time=2025-11-04T05:30:20.890-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.61"
time=2025-11-04T05:30:21.141-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.73"
time=2025-11-04T05:30:21.393-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.86"
time=2025-11-04T05:30:21.643-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.97"
time=2025-11-04T05:30:21.894-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.98"
time=2025-11-04T05:30:21.997-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0
time=2025-11-04T05:30:22.147-06:00 level=INFO source=server.go:1310 msg="llama runner started in 2.67 seconds"
time=2025-11-04T05:30:22.147-06:00 level=DEBUG source=sched.go:494 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:30:22.147-06:00 level=DEBUG source=sched.go:534 msg="gpu reported" gpu=0 library=Metal available="10.7 GiB"
time=2025-11-04T05:30:22.147-06:00 level=INFO source=sched.go:545 msg="updated VRAM based on existing loaded models" gpu=0 library=Metal total="10.7 GiB" available="2.7 GiB"
time=2025-11-04T05:30:22.231-06:00 level=DEBUG source=server.go:1422 msg="completion request" images=0 prompt=43962 format=""
time=2025-11-04T05:30:22.257-06:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=0 prompt=10012 used=0 remaining=10012
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_mul_f32_4', name = 'kernel_rms_norm_mul_f32_4'
ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_mul_f32_4                     0x142706f80 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q8_0_f32', name = 'kernel_mul_mm_q8_0_f32_bci=0_bco=0'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q8_0_f32_bci=0_bco=0            0x142708040 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rope_neox_f32', name = 'kernel_rope_neox_f32'
ggml_metal_library_compile_pipeline: loaded kernel_rope_neox_f32                          0x142707710 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_cpy_f32_f16', name = 'kernel_cpy_f32_f16'
ggml_metal_library_compile_pipeline: loaded kernel_cpy_f32_f16                            0x1427087a0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_blk', name = 'kernel_flash_attn_ext_blk_nqptg=8_ncpsg=64'
ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_blk_nqptg=8_ncpsg=64      0x142708e60 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_f16_dk128_dv128', name = 'kernel_flash_attn_ext_f16_dk128_dv128_mask=1_sinks=0_bias=0_scap=0_kvpad=0_bcm=0_ns10=1024_ns20=1024_nsg=4'
ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_f16_dk128_dv128_mask=1_sinks=0_bias=0_scap=0_kvpad=0_bcm=0_ns10=1024_ns20=1024_nsg=4      0x142709740 | th_max =  768 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_fuse_1', name = 'kernel_add_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_fuse_1                             0x1427092e0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_swiglu_f32', name = 'kernel_swiglu_f32'
ggml_metal_library_compile_pipeline: loaded kernel_swiglu_f32                             0x142709cb0 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.028 sec
ggml_metal_device_init: GPU name:   Apple M2
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG
load: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG
load: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG
load: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG
load: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG
load: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG
load: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG
load: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG
load: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG
load: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG
load: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG
load: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG
load: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG
load: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG
load: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG
load: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG
load: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG
load: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG
load: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG
load: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG
load: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG
load: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG
load: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG
load: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG
load: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG
load: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG
load: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG
load: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG
load: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG
load: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG
load: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG
load: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG
load: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG
load: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG
load: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG
load: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG
load: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG
load: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG
load: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG
load: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG
load: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG
load: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG
load: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG
load: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG
load: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG
load: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG
load: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG
load: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG
load: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG
load: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG
load: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG
load: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG
load: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG
load: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG
load: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG
load: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG
load: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG
load: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG
load: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG
load: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG
load: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG
load: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG
load: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG
load: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG
load: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG
load: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG
load: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG
load: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG
load: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG
load: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG
load: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG
load: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG
load: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG
load: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG
load: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG
load: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG
load: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG
load: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG
load: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG
load: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG
load: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG
load: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG
load: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG
load: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG
load: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG
load: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG
load: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG
load: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG
load: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG
load: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG
load: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG
load: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG
load: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG
load: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG
load: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG
load: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG
load: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG
load: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG
load: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG
load: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG
load: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG
load: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG
load: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG
load: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG
load: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG
load: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG
load: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG
load: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG
load: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG
load: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG
load: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG
load: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG
load: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG
load: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG
load: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG
load: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG
load: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG
load: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG
load: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG
load: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG
load: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG
load: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG
load: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG
load: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG
load: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG
load: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG
load: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG
load: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG
load: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG
load: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG
load: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG
load: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG
load: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG
load: control token: 128010 '<|python_tag|>' is not marked as EOG
load: control token: 128006 '<|start_header_id|>' is not marked as EOG
load: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG
load: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG
load: control token: 128000 '<|begin_of_text|>' is not marked as EOG
load: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG
load: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG
load: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG
load: control token: 128007 '<|end_header_id|>' is not marked as EOG
load: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG
load: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG
load: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG
load: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG
load: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG
load: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG
load: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG
load: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG
load: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG
load: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG
load: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG
load: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG
load: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG
load: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG
load: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG
load: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG
load: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG
load: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG
load: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG
load: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG
load: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG
load: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG
load: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG
load: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG
load: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG
load: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG
load: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG
load: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG
load: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG
load: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG
load: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG
load: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG
load: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG
load: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG
load: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG
load: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG
load: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG
load: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG
load: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG
load: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG
load: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG
load: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG
load: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG
load: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG
load: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG
load: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG
load: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG
load: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG
load: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG
load: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG
load: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG
load: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG
load: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG
load: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG
load: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG
load: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG
load: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG
load: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG
load: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG
load: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG
load: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG
load: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG
load: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG
load: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG
load: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG
load: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG
load: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG
load: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG
load: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG
load: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG
load: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG
load: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG
load: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG
load: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG
load: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG
load: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG
load: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG
load: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG
load: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG
load: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG
load: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG
load: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG
load: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG
load: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG
load: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG
load: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG
load: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG
load: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG
load: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG
load: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG
load: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG
load: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG
load: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG
load: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG
load: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG
load: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG
load: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG
load: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG
load: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG
load: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG
load: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG
load: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG
load: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG
load: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG
load: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG
load: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG
load: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG
load: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG
load: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG
load: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG
load: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-11-04T05:30:22.969-06:00 level=DEBUG source=server.go:331 msg="adding gpu dependency paths" paths=[/Applications/Ollama.app/Contents/Resources]
time=2025-11-04T05:30:22.969-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --port 52412"
time=2025-11-04T05:30:22.969-06:00 level=DEBUG source=server.go:401 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_HOST=0.0.0.0 OLLAMA_KEEP_ALIVE=-1 OLLAMA_MODELS=/Users/pjv/.ollama/models PATH="/Users/pjv/.local/bin:/Users/pjv/go/bin:/opt/homebrew/sbin:/opt/homebrew/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_MAX_LOADED_MODELS=3 DYLD_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources:/Applications/Ollama.app/Contents/Resources:/Applications/Ollama.app/Contents/Resources OLLAMA_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources:/Applications/Ollama.app/Contents/Resources
time=2025-11-04T05:30:22.973-06:00 level=INFO source=server.go:505 msg="system memory" total="16.0 GiB" free="3.6 GiB" free_swap="0 B"
time=2025-11-04T05:30:22.973-06:00 level=DEBUG source=memory.go:181 msg=evaluating library=Metal gpu_count=1 available="[2.7 GiB]"
time=2025-11-04T05:30:22.974-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=llama.vision.block_count default=0
time=2025-11-04T05:30:22.974-06:00 level=DEBUG source=ggml.go:611 msg="default cache size estimate" "attention MiB"=1792 "attention bytes"=1879048192 "recurrent MiB"=0 "recurrent bytes"=0
time=2025-11-04T05:30:22.974-06:00 level=INFO source=server.go:512 msg="model requires more memory than is currently available, evicting a model to make space" estimate.library="" estimate.layers.requested=0 estimate.layers.model=0 estimate.layers.offload=0 estimate.layers.split=[] estimate.memory.available=[] estimate.memory.gpu_overhead="0 B" estimate.memory.required.full="0 B" estimate.memory.required.partial="0 B" estimate.memory.required.kv="0 B" estimate.memory.required.allocations=[] estimate.memory.weights.total="0 B" estimate.memory.weights.repeating="0 B" estimate.memory.weights.nonrepeating="0 B" estimate.memory.graph.full="0 B" estimate.memory.graph.partial="0 B"
time=2025-11-04T05:30:22.974-06:00 level=DEBUG source=sched.go:787 msg="no idle runners, picking the shortest duration" runner_count=1 runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:30:22.974-06:00 level=DEBUG source=sched.go:240 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 refCount=1
time=2025-11-04T05:30:22.974-06:00 level=DEBUG source=sched.go:251 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:30:22.987-06:00 level=INFO source=runner.go:893 msg="starting go runner"
time=2025-11-04T05:30:22.987-06:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/Applications/Ollama.app/Contents/Resources
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.019 sec
ggml_metal_device_init: GPU name:   Apple M2
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
time=2025-11-04T05:30:22.988-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
time=2025-11-04T05:30:23.087-06:00 level=INFO source=runner.go:929 msg="Server listening on 127.0.0.1:52412"
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q8_0_f32', name = 'kernel_mul_mm_q8_0_f32_bci=0_bco=1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q8_0_f32_bci=0_bco=1            0x142713fb0 | th_max =  896 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_get_rows_f32', name = 'kernel_get_rows_f32'
ggml_metal_library_compile_pipeline: loaded kernel_get_rows_f32                           0x142613cf0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_1', name = 'kernel_add_row_c4_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_1                      0x142614350 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q8_0_f32', name = 'kernel_mul_mv_q8_0_f32_nsg=4'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q8_0_f32_nsg=4                  0x142614ee0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_f16_dk128_dv128', name = 'kernel_flash_attn_ext_vec_f16_dk128_dv128_mask=1_sink=0_bias=0_scap=0_kvpad=0_ns10=1024_ns20=1024_nsg=4_nwg=32'
ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_f16_dk128_dv128_mask=1_sink=0_bias=0_scap=0_kvpad=0_ns10=1024_ns20=1024_nsg=4_nwg=32      0x142615140 | th_max =  448 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_reduce', name = 'kernel_flash_attn_ext_vec_reduce_dv=128_nwg=32'
ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_reduce_dv=128_nwg=32      0x142615550 | th_max = 1024 | th_width =   32
[GIN] 2025/11/04 - 05:31:07 | 200 | 47.855582791s |  100.116.113.16 | POST     "/v1/chat/completions"
time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=sched.go:502 msg="context for request finished"
time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=sched.go:287 msg="runner with zero duration has gone idle, expiring to unload" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=sched.go:312 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 refCount=0
time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=sched.go:315 msg="runner expired event received" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=sched.go:330 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=sched.go:353 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=sched.go:637 msg="no need to wait for VRAM recovery" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=server.go:1720 msg="stopping llama server" pid=20963
time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=server.go:1726 msg="waiting for llama server to exit" pid=20963
time=2025-11-04T05:31:07.229-06:00 level=DEBUG source=server.go:1730 msg="llama server stopped" pid=20963
time=2025-11-04T05:31:07.229-06:00 level=DEBUG source=sched.go:362 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a
time=2025-11-04T05:31:07.229-06:00 level=DEBUG source=sched.go:365 msg="sending an unloaded event" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a
time=2025-11-04T05:31:07.229-06:00 level=DEBUG source=sched.go:257 msg="unload completed" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a
time=2025-11-04T05:31:07.229-06:00 level=DEBUG source=runner.go:45 msg="overall device VRAM discovery took" duration=584ns
time=2025-11-04T05:31:07.242-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
time=2025-11-04T05:31:07.242-06:00 level=DEBUG source=sched.go:215 msg="loading first model" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2025-11-04T05:31:07.274-06:00 level=INFO source=server.go:505 msg="system memory" total="16.0 GiB" free="8.1 GiB" free_swap="0 B"
time=2025-11-04T05:31:07.274-06:00 level=DEBUG source=memory.go:181 msg=evaluating library=Metal gpu_count=1 available="[10.7 GiB]"
time=2025-11-04T05:31:07.274-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=llama.vision.block_count default=0
time=2025-11-04T05:31:07.274-06:00 level=DEBUG source=ggml.go:611 msg="default cache size estimate" "attention MiB"=1792 "attention bytes"=1879048192 "recurrent MiB"=0 "recurrent bytes"=0
time=2025-11-04T05:31:07.274-06:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff library=Metal parallel=1 required="4.6 GiB" gpus=1
time=2025-11-04T05:31:07.274-06:00 level=DEBUG source=memory.go:181 msg=evaluating library=Metal gpu_count=1 available="[10.7 GiB]"
time=2025-11-04T05:31:07.274-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=llama.vision.block_count default=0
time=2025-11-04T05:31:07.274-06:00 level=DEBUG source=ggml.go:611 msg="default cache size estimate" "attention MiB"=1792 "attention bytes"=1879048192 "recurrent MiB"=0 "recurrent bytes"=0
time=2025-11-04T05:31:07.274-06:00 level=INFO source=server.go:545 msg=offload library=Metal layers.requested=-1 layers.model=29 layers.offload=29 layers.split=[29] memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.6 GiB" memory.required.partial="4.6 GiB" memory.required.kv="1.8 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.6 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="824.0 MiB"
time=2025-11-04T05:31:07.275-06:00 level=INFO source=runner.go:828 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:4 GPULayers:29[ID:0 Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
time=2025-11-04T05:31:07.276-06:00 level=INFO source=server.go:1272 msg="waiting for llama runner to start responding"
llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free
time=2025-11-04T05:31:07.276-06:00 level=INFO source=server.go:1306 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG
load: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG
load: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG
load: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG
load: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG
load: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG
load: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG
load: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG
load: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG
load: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG
load: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG
load: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG
load: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG
load: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG
load: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG
load: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG
load: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG
load: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG
load: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG
load: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG
load: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG
load: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG
load: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG
load: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG
load: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG
load: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG
load: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG
load: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG
load: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG
load: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG
load: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG
load: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG
load: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG
load: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG
load: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG
load: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG
load: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG
load: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG
load: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG
load: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG
load: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG
load: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG
load: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG
load: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG
load: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG
load: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG
load: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG
load: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG
load: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG
load: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG
load: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG
load: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG
load: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG
load: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG
load: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG
load: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG
load: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG
load: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG
load: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG
load: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG
load: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG
load: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG
load: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG
load: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG
load: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG
load: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG
load: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG
load: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG
load: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG
load: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG
load: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG
load: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG
load: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG
load: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG
load: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG
load: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG
load: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG
load: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG
load: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG
load: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG
load: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG
load: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG
load: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG
load: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG
load: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG
load: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG
load: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG
load: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG
load: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG
load: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG
load: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG
load: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG
load: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG
load: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG
load: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG
load: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG
load: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG
load: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG
load: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG
load: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG
load: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG
load: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG
load: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG
load: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG
load: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG
load: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG
load: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG
load: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG
load: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG
load: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG
load: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG
load: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG
load: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG
load: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG
load: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG
load: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG
load: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG
load: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG
load: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG
load: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG
load: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG
load: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG
load: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG
load: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG
load: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG
load: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG
load: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG
load: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG
load: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG
load: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG
load: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG
load: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG
load: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG
load: control token: 128010 '<|python_tag|>' is not marked as EOG
load: control token: 128006 '<|start_header_id|>' is not marked as EOG
load: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG
load: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG
load: control token: 128000 '<|begin_of_text|>' is not marked as EOG
load: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG
load: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG
load: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG
load: control token: 128007 '<|end_header_id|>' is not marked as EOG
load: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG
load: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG
load: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG
load: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG
load: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG
load: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG
load: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG
load: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG
load: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG
load: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG
load: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG
load: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG
load: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG
load: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG
load: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG
load: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG
load: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG
load: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG
load: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG
load: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG
load: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG
load: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG
load: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG
load: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG
load: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG
load: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG
load: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG
load: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG
load: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG
load: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG
load: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG
load: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG
load: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG
load: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG
load: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG
load: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG
load: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG
load: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG
load: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG
load: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG
load: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG
load: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG
load: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG
load: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG
load: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG
load: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG
load: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG
load: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG
load: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG
load: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG
load: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG
load: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG
load: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG
load: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG
load: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG
load: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG
load: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG
load: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG
load: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG
load: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG
load: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG
load: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG
load: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG
load: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG
load: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG
load: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG
load: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG
load: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG
load: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG
load: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG
load: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG
load: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG
load: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG
load: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG
load: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG
load: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG
load: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG
load: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG
load: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG
load: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG
load: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG
load: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG
load: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG
load: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG
load: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG
load: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG
load: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG
load: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG
load: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG
load: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG
load: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG
load: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG
load: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG
load: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG
load: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG
load: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG
load: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG
load: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG
load: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG
load: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG
load: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG
load: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG
load: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG
load: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG
load: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG
load: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG
load: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG
load: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG
load: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG
load: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG
load: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3072
print_info: n_layer          = 28
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 3B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device Metal, is_swa = 0
load_tensors: layer   1 assigned to device Metal, is_swa = 0
load_tensors: layer   2 assigned to device Metal, is_swa = 0
load_tensors: layer   3 assigned to device Metal, is_swa = 0
load_tensors: layer   4 assigned to device Metal, is_swa = 0
load_tensors: layer   5 assigned to device Metal, is_swa = 0
load_tensors: layer   6 assigned to device Metal, is_swa = 0
load_tensors: layer   7 assigned to device Metal, is_swa = 0
load_tensors: layer   8 assigned to device Metal, is_swa = 0
load_tensors: layer   9 assigned to device Metal, is_swa = 0
load_tensors: layer  10 assigned to device Metal, is_swa = 0
load_tensors: layer  11 assigned to device Metal, is_swa = 0
load_tensors: layer  12 assigned to device Metal, is_swa = 0
load_tensors: layer  13 assigned to device Metal, is_swa = 0
load_tensors: layer  14 assigned to device Metal, is_swa = 0
load_tensors: layer  15 assigned to device Metal, is_swa = 0
load_tensors: layer  16 assigned to device Metal, is_swa = 0
load_tensors: layer  17 assigned to device Metal, is_swa = 0
load_tensors: layer  18 assigned to device Metal, is_swa = 0
load_tensors: layer  19 assigned to device Metal, is_swa = 0
load_tensors: layer  20 assigned to device Metal, is_swa = 0
load_tensors: layer  21 assigned to device Metal, is_swa = 0
load_tensors: layer  22 assigned to device Metal, is_swa = 0
load_tensors: layer  23 assigned to device Metal, is_swa = 0
load_tensors: layer  24 assigned to device Metal, is_swa = 0
load_tensors: layer  25 assigned to device Metal, is_swa = 0
load_tensors: layer  26 assigned to device Metal, is_swa = 0
load_tensors: layer  27 assigned to device Metal, is_swa = 0
load_tensors: layer  28 assigned to device Metal, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor rope_freqs.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight
create_tensor: loading tensor blk.1.attn_k.weight
create_tensor: loading tensor blk.1.attn_v.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate.weight
create_tensor: loading tensor blk.1.ffn_down.weight
create_tensor: loading tensor blk.1.ffn_up.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_q.weight
create_tensor: loading tensor blk.2.attn_k.weight
create_tensor: loading tensor blk.2.attn_v.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate.weight
create_tensor: loading tensor blk.2.ffn_down.weight
create_tensor: loading tensor blk.2.ffn_up.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate.weight
create_tensor: loading tensor blk.3.ffn_down.weight
create_tensor: loading tensor blk.3.ffn_up.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_q.weight
create_tensor: loading tensor blk.4.attn_k.weight
create_tensor: loading tensor blk.4.attn_v.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate.weight
create_tensor: loading tensor blk.4.ffn_down.weight
create_tensor: loading tensor blk.4.ffn_up.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.attn_q.weight
create_tensor: loading tensor blk.5.attn_k.weight
create_tensor: loading tensor blk.5.attn_v.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate.weight
create_tensor: loading tensor blk.5.ffn_down.weight
create_tensor: loading tensor blk.5.ffn_up.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_q.weight
create_tensor: loading tensor blk.6.attn_k.weight
create_tensor: loading tensor blk.6.attn_v.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate.weight
create_tensor: loading tensor blk.6.ffn_down.weight
create_tensor: loading tensor blk.6.ffn_up.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate.weight
create_tensor: loading tensor blk.7.ffn_down.weight
create_tensor: loading tensor blk.7.ffn_up.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_q.weight
create_tensor: loading tensor blk.8.attn_k.weight
create_tensor: loading tensor blk.8.attn_v.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate.weight
create_tensor: loading tensor blk.8.ffn_down.weight
create_tensor: loading tensor blk.8.ffn_up.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_q.weight
create_tensor: loading tensor blk.9.attn_k.weight
create_tensor: loading tensor blk.9.attn_v.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate.weight
create_tensor: loading tensor blk.9.ffn_down.weight
create_tensor: loading tensor blk.9.ffn_up.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_q.weight
create_tensor: loading tensor blk.10.attn_k.weight
create_tensor: loading tensor blk.10.attn_v.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate.weight
create_tensor: loading tensor blk.10.ffn_down.weight
create_tensor: loading tensor blk.10.ffn_up.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_v.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate.weight
create_tensor: loading tensor blk.11.ffn_down.weight
create_tensor: loading tensor blk.11.ffn_up.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_q.weight
create_tensor: loading tensor blk.12.attn_k.weight
create_tensor: loading tensor blk.12.attn_v.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate.weight
create_tensor: loading tensor blk.12.ffn_down.weight
create_tensor: loading tensor blk.12.ffn_up.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_q.weight
create_tensor: loading tensor blk.13.attn_k.weight
create_tensor: loading tensor blk.13.attn_v.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate.weight
create_tensor: loading tensor blk.13.ffn_down.weight
create_tensor: loading tensor blk.13.ffn_up.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_q.weight
create_tensor: loading tensor blk.14.attn_k.weight
create_tensor: loading tensor blk.14.attn_v.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate.weight
create_tensor: loading tensor blk.14.ffn_down.weight
create_tensor: loading tensor blk.14.ffn_up.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate.weight
create_tensor: loading tensor blk.15.ffn_down.weight
create_tensor: loading tensor blk.15.ffn_up.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_q.weight
create_tensor: loading tensor blk.16.attn_k.weight
create_tensor: loading tensor blk.16.attn_v.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate.weight
create_tensor: loading tensor blk.16.ffn_down.weight
create_tensor: loading tensor blk.16.ffn_up.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.attn_q.weight
create_tensor: loading tensor blk.17.attn_k.weight
create_tensor: loading tensor blk.17.attn_v.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate.weight
create_tensor: loading tensor blk.17.ffn_down.weight
create_tensor: loading tensor blk.17.ffn_up.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_q.weight
create_tensor: loading tensor blk.18.attn_k.weight
create_tensor: loading tensor blk.18.attn_v.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor blk.18.ffn_gate.weight
create_tensor: loading tensor blk.18.ffn_down.weight
create_tensor: loading tensor blk.18.ffn_up.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate.weight
create_tensor: loading tensor blk.19.ffn_down.weight
create_tensor: loading tensor blk.19.ffn_up.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.attn_q.weight
create_tensor: loading tensor blk.20.attn_k.weight
create_tensor: loading tensor blk.20.attn_v.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.ffn_norm.weight
create_tensor: loading tensor blk.20.ffn_gate.weight
create_tensor: loading tensor blk.20.ffn_down.weight
create_tensor: loading tensor blk.20.ffn_up.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.attn_q.weight
create_tensor: loading tensor blk.21.attn_k.weight
create_tensor: loading tensor blk.21.attn_v.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.ffn_norm.weight
create_tensor: loading tensor blk.21.ffn_gate.weight
create_tensor: loading tensor blk.21.ffn_down.weight
create_tensor: loading tensor blk.21.ffn_up.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.attn_q.weight
create_tensor: loading tensor blk.22.attn_k.weight
create_tensor: loading tensor blk.22.attn_v.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.ffn_norm.weight
create_tensor: loading tensor blk.22.ffn_gate.weight
create_tensor: loading tensor blk.22.ffn_down.weight
create_tensor: loading tensor blk.22.ffn_up.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_v.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.ffn_norm.weight
create_tensor: loading tensor blk.23.ffn_gate.weight
create_tensor: loading tensor blk.23.ffn_down.weight
create_tensor: loading tensor blk.23.ffn_up.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.attn_q.weight
create_tensor: loading tensor blk.24.attn_k.weight
create_tensor: loading tensor blk.24.attn_v.weight
create_tensor: loading tensor blk.24.attn_output.weight
create_tensor: loading tensor blk.24.ffn_norm.weight
create_tensor: loading tensor blk.24.ffn_gate.weight
create_tensor: loading tensor blk.24.ffn_down.weight
create_tensor: loading tensor blk.24.ffn_up.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.attn_q.weight
create_tensor: loading tensor blk.25.attn_k.weight
create_tensor: loading tensor blk.25.attn_v.weight
create_tensor: loading tensor blk.25.attn_output.weight
create_tensor: loading tensor blk.25.ffn_norm.weight
create_tensor: loading tensor blk.25.ffn_gate.weight
create_tensor: loading tensor blk.25.ffn_down.weight
create_tensor: loading tensor blk.25.ffn_up.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.attn_q.weight
create_tensor: loading tensor blk.26.attn_k.weight
create_tensor: loading tensor blk.26.attn_v.weight
create_tensor: loading tensor blk.26.attn_output.weight
create_tensor: loading tensor blk.26.ffn_norm.weight
create_tensor: loading tensor blk.26.ffn_gate.weight
create_tensor: loading tensor blk.26.ffn_down.weight
create_tensor: loading tensor blk.26.ffn_up.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.attn_q.weight
create_tensor: loading tensor blk.27.attn_k.weight
create_tensor: loading tensor blk.27.attn_v.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.ffn_norm.weight
create_tensor: loading tensor blk.27.ffn_gate.weight
create_tensor: loading tensor blk.27.ffn_down.weight
create_tensor: loading tensor blk.27.ffn_up.weight
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
load_tensors: Metal_Mapped model buffer size =  1918.35 MiB
llama_init_from_model: model default pooling_type is [0], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: use bfloat         = true
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.50 MiB
create_memory: n_ctx = 16384 (padded)
llama_kv_cache: layer   0: dev = Metal
llama_kv_cache: layer   1: dev = Metal
llama_kv_cache: layer   2: dev = Metal
llama_kv_cache: layer   3: dev = Metal
llama_kv_cache: layer   4: dev = Metal
llama_kv_cache: layer   5: dev = Metal
llama_kv_cache: layer   6: dev = Metal
llama_kv_cache: layer   7: dev = Metal
llama_kv_cache: layer   8: dev = Metal
llama_kv_cache: layer   9: dev = Metal
llama_kv_cache: layer  10: dev = Metal
llama_kv_cache: layer  11: dev = Metal
llama_kv_cache: layer  12: dev = Metal
llama_kv_cache: layer  13: dev = Metal
llama_kv_cache: layer  14: dev = Metal
llama_kv_cache: layer  15: dev = Metal
llama_kv_cache: layer  16: dev = Metal
llama_kv_cache: layer  17: dev = Metal
llama_kv_cache: layer  18: dev = Metal
llama_kv_cache: layer  19: dev = Metal
llama_kv_cache: layer  20: dev = Metal
llama_kv_cache: layer  21: dev = Metal
llama_kv_cache: layer  22: dev = Metal
llama_kv_cache: layer  23: dev = Metal
llama_kv_cache: layer  24: dev = Metal
llama_kv_cache: layer  25: dev = Metal
llama_kv_cache: layer  26: dev = Metal
llama_kv_cache: layer  27: dev = Metal
llama_kv_cache:      Metal KV buffer size =  1792.00 MiB
time=2025-11-04T05:31:08.533-06:00 level=DEBUG source=server.go:1316 msg="model load progress 1.00"
llama_kv_cache: size = 1792.00 MiB ( 16384 cells,  28 layers,  1/1 seqs), K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: max_nodes = 2048
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:      Metal compute buffer size =   816.01 MiB
llama_context:        CPU compute buffer size =    42.01 MiB
llama_context: graph nodes  = 1014
llama_context: graph splits = 2
time=2025-11-04T05:31:08.785-06:00 level=INFO source=server.go:1310 msg="llama runner started in 45.82 seconds"
time=2025-11-04T05:31:08.785-06:00 level=INFO source=sched.go:482 msg="loaded runners" count=1
time=2025-11-04T05:31:08.785-06:00 level=DEBUG source=sched.go:587 msg="evaluating already loaded" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2025-11-04T05:31:08.785-06:00 level=INFO source=server.go:1272 msg="waiting for llama runner to start responding"
time=2025-11-04T05:31:08.785-06:00 level=INFO source=server.go:1310 msg="llama runner started in 45.82 seconds"
time=2025-11-04T05:31:08.785-06:00 level=DEBUG source=sched.go:494 msg="finished setting up" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="4.6 GiB" runner.vram="4.6 GiB" runner.parallel=1 runner.pid=20964 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384
time=2025-11-04T05:31:08.786-06:00 level=DEBUG source=sched.go:587 msg="evaluating already loaded" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2025-11-04T05:31:08.788-06:00 level=DEBUG source=server.go:1422 msg="completion request" images=0 prompt=1079 format=""
time=2025-11-04T05:31:08.789-06:00 level=DEBUG source=server.go:1422 msg="completion request" images=0 prompt=1079 format=""
time=2025-11-04T05:31:08.789-06:00 level=DEBUG source=server.go:1422 msg="completion request" images=0 prompt=1023 format=""
time=2025-11-04T05:31:08.793-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=229 used=0 remaining=229
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_mul_f32_4', name = 'kernel_rms_norm_mul_f32_4'
ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_mul_f32_4                     0x11ce09d40 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q4_K_f32', name = 'kernel_mul_mm_q4_K_f32_bci=0_bco=1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q4_K_f32_bci=0_bco=1            0x11ce0ac50 | th_max =  896 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q6_K_f32', name = 'kernel_mul_mm_q6_K_f32_bci=0_bco=1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q6_K_f32_bci=0_bco=1            0x11ce0b200 | th_max =  896 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rope_norm_f32', name = 'kernel_rope_norm_f32'
ggml_metal_library_compile_pipeline: loaded kernel_rope_norm_f32                          0x10d806d20 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_set_rows_f16_i64', name = 'kernel_set_rows_f16_i64'
ggml_metal_library_compile_pipeline: loaded kernel_set_rows_f16_i64                       0x10d806f80 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_f16_f32', name = 'kernel_mul_mm_f16_f32_bci=0_bco=1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_f16_f32_bci=0_bco=1             0x10d807a10 | th_max =  896 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_soft_max_f32_4', name = 'kernel_soft_max_f32_4'
ggml_metal_library_compile_pipeline: loaded kernel_soft_max_f32_4                         0x10d807e80 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_cpy_f32_f32', name = 'kernel_cpy_f32_f32'
ggml_metal_library_compile_pipeline: loaded kernel_cpy_f32_f32                            0x10d808290 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_fuse_1', name = 'kernel_add_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_fuse_1                             0x10d8088b0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_swiglu_f32', name = 'kernel_swiglu_f32'
ggml_metal_library_compile_pipeline: loaded kernel_swiglu_f32                             0x10d809030 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_get_rows_f32', name = 'kernel_get_rows_f32'
ggml_metal_library_compile_pipeline: loaded kernel_get_rows_f32                           0x10d811900 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_1', name = 'kernel_add_row_c4_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_1                      0x10d811cd0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q4_K_f32', name = 'kernel_mul_mv_q4_K_f32_nsg=2'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q4_K_f32_nsg=2                  0x10d812d30 | th_max =  768 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q6_K_f32', name = 'kernel_mul_mv_q6_K_f32_nsg=2'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q6_K_f32_nsg=2                  0x10d813490 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_f16_f32_4', name = 'kernel_mul_mv_f16_f32_4_nsg=1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_f16_f32_4_nsg=1                 0x10d815590 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_f16_f32_4', name = 'kernel_mul_mv_f16_f32_4_nsg=2'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_f16_f32_4_nsg=2                 0x11ce07420 | th_max = 1024 | th_width =   32
[GIN] 2025/11/04 - 05:31:09 | 200 | 50.409380041s |  100.116.113.16 | POST     "/v1/chat/completions"
time=2025-11-04T05:31:09.730-06:00 level=DEBUG source=sched.go:502 msg="context for request finished"
time=2025-11-04T05:31:09.730-06:00 level=DEBUG source=sched.go:312 msg="after processing request finished event" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="4.6 GiB" runner.vram="4.6 GiB" runner.parallel=1 runner.pid=20964 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 refCount=2
time=2025-11-04T05:31:09.730-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=236 prompt=229 used=228 remaining=1
[GIN] 2025/11/04 - 05:31:09 | 200 |  2.663052667s |  100.116.113.16 | POST     "/v1/chat/completions"
time=2025-11-04T05:31:09.940-06:00 level=DEBUG source=sched.go:389 msg="context for request finished" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="4.6 GiB" runner.vram="4.6 GiB" runner.parallel=1 runner.pid=20964 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384
time=2025-11-04T05:31:09.940-06:00 level=DEBUG source=sched.go:312 msg="after processing request finished event" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="4.6 GiB" runner.vram="4.6 GiB" runner.parallel=1 runner.pid=20964 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 refCount=1
time=2025-11-04T05:31:09.940-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=236 prompt=220 used=208 remaining=12
[GIN] 2025/11/04 - 05:31:10 | 200 | 50.888554209s |  100.116.113.16 | POST     "/v1/chat/completions"
time=2025-11-04T05:31:10.211-06:00 level=DEBUG source=sched.go:389 msg="context for request finished" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="4.6 GiB" runner.vram="4.6 GiB" runner.parallel=1 runner.pid=20964 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384
time=2025-11-04T05:31:10.211-06:00 level=DEBUG source=sched.go:294 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="4.6 GiB" runner.vram="4.6 GiB" runner.parallel=1 runner.pid=20964 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 duration=2562047h47m16.854775807s
time=2025-11-04T05:31:10.211-06:00 level=DEBUG source=sched.go:312 msg="after processing request finished event" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="4.6 GiB" runner.vram="4.6 GiB" runner.parallel=1 runner.pid=20964 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 refCount=0

0.12.9 log

time=2025-11-04T05:37:14.982-06:00 level=INFO source=runner.go:76 msg="discovering available GPUs..."
time=2025-11-04T05:37:14.984-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --port 52548"
time=2025-11-04T05:37:14.984-06:00 level=DEBUG source=server.go:401 msg=subprocess OLLAMA_DEBUG=1 PATH="/Users/pjv/.local/bin:/Users/pjv/go/bin:/opt/homebrew/sbin:/opt/homebrew/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin" OLLAMA_HOST=0.0.0.0 OLLAMA_MODELS=/Users/pjv/.ollama/models OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_KEEP_ALIVE=-1 DYLD_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources OLLAMA_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources
time=2025-11-04T05:37:15.101-06:00 level=DEBUG source=runner.go:471 msg="bootstrap discovery took" duration=118.973083ms OLLAMA_LIBRARY_PATH=[/Applications/Ollama.app/Contents/Resources] extra_envs=map[]
time=2025-11-04T05:37:15.101-06:00 level=DEBUG source=runner.go:120 msg="evluating which if any devices to filter out" initial_count=1
time=2025-11-04T05:37:15.101-06:00 level=DEBUG source=runner.go:41 msg="GPU bootstrap discovery took" duration=119.23475ms
time=2025-11-04T05:37:15.101-06:00 level=INFO source=types.go:42 msg="inference compute" id=0 filtered_id="" library=Metal compute=0.0 name=Metal description="Apple M2" libdirs="" driver=0.0 pci_id="" type=discrete total="10.7 GiB" available="10.7 GiB"
time=2025-11-04T05:37:15.101-06:00 level=INFO source=routes.go:1618 msg="entering low vram mode" "total vram"="10.7 GiB" threshold="20.0 GiB"
[GIN] 2025/11/04 - 05:37:15 | 200 |         110µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/11/04 - 05:37:15 | 200 |       239.5µs |       127.0.0.1 | GET      "/api/ps"
time=2025-11-04T05:37:34.851-06:00 level=DEBUG source=runner.go:41 msg="overall device VRAM discovery took" duration=8.542µs
time=2025-11-04T05:37:34.851-06:00 level=DEBUG source=sched.go:189 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-11-04T05:37:34.859-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
time=2025-11-04T05:37:34.860-06:00 level=DEBUG source=sched.go:204 msg="loading first model" model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a
time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0
time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.type default=""
time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.factor default=1
time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.original_context_length default=0
time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_count default=0
time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_used_count default=0
time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.norm_top_k_prob default=true
time=2025-11-04T05:37:34.890-06:00 level=INFO source=server.go:215 msg="enabling flash attention"
time=2025-11-04T05:37:34.891-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a --port 52554"
time=2025-11-04T05:37:34.891-06:00 level=DEBUG source=server.go:401 msg=subprocess OLLAMA_DEBUG=1 PATH="/Users/pjv/.local/bin:/Users/pjv/go/bin:/opt/homebrew/sbin:/opt/homebrew/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin" OLLAMA_HOST=0.0.0.0 OLLAMA_MODELS=/Users/pjv/.ollama/models OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_KEEP_ALIVE=-1 DYLD_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources OLLAMA_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources
time=2025-11-04T05:37:34.892-06:00 level=INFO source=server.go:653 msg="loading model" "model layers"=37 requested=-1
time=2025-11-04T05:37:34.892-06:00 level=INFO source=server.go:658 msg="system memory" total="16.0 GiB" free="9.8 GiB" free_swap="0 B"
time=2025-11-04T05:37:34.892-06:00 level=INFO source=server.go:665 msg="gpu memory" id=0 library=Metal available="10.2 GiB" free="10.7 GiB" minimum="512.0 MiB" overhead="0 B"
time=2025-11-04T05:37:34.900-06:00 level=INFO source=runner.go:1349 msg="starting ollama engine"
time=2025-11-04T05:37:34.900-06:00 level=INFO source=runner.go:1384 msg="Server listening on 127.0.0.1:52554"
time=2025-11-04T05:37:34.903-06:00 level=INFO source=runner.go:1222 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-04T05:37:34.918-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
time=2025-11-04T05:37:34.918-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.description default=""
time=2025-11-04T05:37:34.918-06:00 level=INFO source=ggml.go:136 msg="" architecture=qwen3 file_type=Q8_0 name="Qwen3 4B Instruct 2507" description="" num_tensors=398 num_key_values=33
time=2025-11-04T05:37:34.918-06:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/Applications/Ollama.app/Contents/Resources
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_device_init: GPU name:   Apple M2
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
time=2025-11-04T05:37:34.919-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: use bfloat         = true
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0
time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.type default=""
time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.factor default=1
time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.original_context_length default=0
time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_count default=0
time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_used_count default=0
time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.norm_top_k_prob default=true
time=2025-11-04T05:37:35.033-06:00 level=DEBUG source=ggml.go:857 msg="compute graph" nodes=1231 splits=2
time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=ggml.go:857 msg="compute graph" nodes=1231 splits=2
time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=device.go:212 msg="model weights" device=Metal size="4.0 GiB"
time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=device.go:217 msg="model weights" device=CPU size="394.1 MiB"
time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=device.go:223 msg="kv cache" device=Metal size="3.9 GiB"
time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=device.go:234 msg="compute graph" device=Metal size="131.5 MiB"
time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=device.go:239 msg="compute graph" device=CPU size="5.0 MiB"
time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=device.go:244 msg="total memory" size="8.4 GiB"
time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=server.go:695 msg=memory success=true required.InputWeights=413265920 required.CPU.Graph=5242880 required.Metal.ID=0 required.Metal.Weights="[107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 413276160]" required.Metal.Cache="[115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 0]" required.Metal.Graph=137889824
time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=server.go:892 msg="available gpu" id=0 library=Metal "available layer vram"="10.0 GiB" backoff=0.00 minimum="512.0 MiB" overhead="0 B" graph="131.5 MiB"
time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=server.go:706 msg="new layout created" layers="37[ID:0 Layers:37(0..36)]"
time=2025-11-04T05:37:35.036-06:00 level=INFO source=runner.go:1222 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-04T05:37:35.049-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0
time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.type default=""
time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.factor default=1
time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.original_context_length default=0
time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_count default=0
time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_used_count default=0
time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.norm_top_k_prob default=true
time=2025-11-04T05:37:35.326-06:00 level=DEBUG source=ggml.go:857 msg="compute graph" nodes=1231 splits=2
time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=ggml.go:857 msg="compute graph" nodes=1231 splits=2
time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=device.go:212 msg="model weights" device=Metal size="4.0 GiB"
time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=device.go:217 msg="model weights" device=CPU size="394.1 MiB"
time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=device.go:223 msg="kv cache" device=Metal size="3.9 GiB"
time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=device.go:234 msg="compute graph" device=Metal size="131.5 MiB"
time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=device.go:239 msg="compute graph" device=CPU size="5.0 MiB"
time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=device.go:244 msg="total memory" size="8.4 GiB"
time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=server.go:695 msg=memory success=true required.InputWeights=413265920 required.CPU.Graph=5242880 required.Metal.ID=0 required.Metal.Weights="[107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 413276160]" required.Metal.Cache="[115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 0]" required.Metal.Graph=137889824
time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=server.go:892 msg="available gpu" id=0 library=Metal "available layer vram"="10.0 GiB" backoff=0.00 minimum="512.0 MiB" overhead="0 B" graph="131.5 MiB"
time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=server.go:706 msg="new layout created" layers="37[ID:0 Layers:37(0..36)]"
time=2025-11-04T05:37:35.329-06:00 level=INFO source=runner.go:1222 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-04T05:37:35.329-06:00 level=INFO source=ggml.go:482 msg="offloading 36 repeating layers to GPU"
time=2025-11-04T05:37:35.329-06:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2025-11-04T05:37:35.329-06:00 level=INFO source=ggml.go:494 msg="offloaded 37/37 layers to GPU"
time=2025-11-04T05:37:35.329-06:00 level=INFO source=device.go:212 msg="model weights" device=Metal size="4.0 GiB"
time=2025-11-04T05:37:35.329-06:00 level=INFO source=device.go:217 msg="model weights" device=CPU size="394.1 MiB"
time=2025-11-04T05:37:35.329-06:00 level=INFO source=device.go:223 msg="kv cache" device=Metal size="3.9 GiB"
time=2025-11-04T05:37:35.330-06:00 level=INFO source=device.go:234 msg="compute graph" device=Metal size="131.5 MiB"
time=2025-11-04T05:37:35.330-06:00 level=INFO source=device.go:239 msg="compute graph" device=CPU size="5.0 MiB"
time=2025-11-04T05:37:35.330-06:00 level=INFO source=device.go:244 msg="total memory" size="8.4 GiB"
time=2025-11-04T05:37:35.330-06:00 level=INFO source=sched.go:493 msg="loaded runners" count=1
time=2025-11-04T05:37:35.330-06:00 level=DEBUG source=runner.go:41 msg="overall device VRAM discovery took" duration=250ns
time=2025-11-04T05:37:35.330-06:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-11-04T05:37:35.330-06:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
time=2025-11-04T05:37:35.340-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
time=2025-11-04T05:37:35.580-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.16"
time=2025-11-04T05:37:35.831-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.32"
time=2025-11-04T05:37:36.081-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.48"
time=2025-11-04T05:37:36.331-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.64"
time=2025-11-04T05:37:36.582-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.79"
time=2025-11-04T05:37:36.834-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.91"
time=2025-11-04T05:37:37.086-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.95"
time=2025-11-04T05:37:37.337-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.95"
time=2025-11-04T05:37:37.802-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0
time=2025-11-04T05:37:37.840-06:00 level=INFO source=server.go:1289 msg="llama runner started in 2.95 seconds"
time=2025-11-04T05:37:37.840-06:00 level=DEBUG source=sched.go:505 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:37:37.840-06:00 level=DEBUG source=sched.go:548 msg="gpu reported" gpu=0 library=Metal available="10.7 GiB"
time=2025-11-04T05:37:37.840-06:00 level=INFO source=sched.go:559 msg="updated VRAM based on existing loaded models" gpu=0 library=Metal total="10.7 GiB" available="2.7 GiB"
time=2025-11-04T05:37:37.885-06:00 level=DEBUG source=server.go:1401 msg="completion request" images=0 prompt=43962 format=""
time=2025-11-04T05:37:37.907-06:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=0 prompt=10012 used=0 remaining=10012
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_mul_f32_4', name = 'kernel_rms_norm_mul_f32_4'
ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_mul_f32_4                     0x14d70ca50 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q8_0_f32', name = 'kernel_mul_mm_q8_0_f32_bci=0_bco=0'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q8_0_f32_bci=0_bco=0            0x14d70d840 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rope_neox_f32', name = 'kernel_rope_neox_f32'
ggml_metal_library_compile_pipeline: loaded kernel_rope_neox_f32                          0x14d70daa0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_cpy_f32_f16', name = 'kernel_cpy_f32_f16'
ggml_metal_library_compile_pipeline: loaded kernel_cpy_f32_f16                            0x14d70dd00 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_blk', name = 'kernel_flash_attn_ext_blk_nqptg=8_ncpsg=64'
ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_blk_nqptg=8_ncpsg=64      0x14d70e660 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_f16_dk128_dv128', name = 'kernel_flash_attn_ext_f16_dk128_dv128_mask=1_sinks=0_bias=0_scap=0_kvpad=0_bcm=0_ns10=1024_ns20=1024_nsg=4'
ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_f16_dk128_dv128_mask=1_sinks=0_bias=0_scap=0_kvpad=0_bcm=0_ns10=1024_ns20=1024_nsg=4      0x153f048e0 | th_max =  768 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_fuse_1', name = 'kernel_add_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_fuse_1                             0x153f043e0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_swiglu_f32', name = 'kernel_swiglu_f32'
ggml_metal_library_compile_pipeline: loaded kernel_swiglu_f32                             0x155004380 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.018 sec
ggml_metal_device_init: GPU name:   Apple M2
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG
load: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG
load: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG
load: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG
load: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG
load: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG
load: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG
load: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG
load: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG
load: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG
load: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG
load: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG
load: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG
load: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG
load: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG
load: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG
load: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG
load: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG
load: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG
load: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG
load: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG
load: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG
load: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG
load: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG
load: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG
load: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG
load: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG
load: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG
load: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG
load: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG
load: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG
load: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG
load: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG
load: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG
load: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG
load: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG
load: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG
load: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG
load: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG
load: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG
load: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG
load: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG
load: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG
load: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG
load: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG
load: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG
load: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG
load: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG
load: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG
load: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG
load: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG
load: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG
load: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG
load: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG
load: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG
load: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG
load: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG
load: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG
load: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG
load: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG
load: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG
load: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG
load: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG
load: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG
load: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG
load: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG
load: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG
load: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG
load: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG
load: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG
load: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG
load: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG
load: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG
load: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG
load: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG
load: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG
load: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG
load: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG
load: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG
load: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG
load: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG
load: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG
load: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG
load: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG
load: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG
load: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG
load: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG
load: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG
load: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG
load: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG
load: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG
load: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG
load: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG
load: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG
load: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG
load: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG
load: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG
load: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG
load: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG
load: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG
load: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG
load: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG
load: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG
load: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG
load: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG
load: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG
load: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG
load: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG
load: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG
load: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG
load: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG
load: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG
load: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG
load: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG
load: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG
load: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG
load: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG
load: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG
load: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG
load: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG
load: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG
load: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG
load: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG
load: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG
load: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG
load: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG
load: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG
load: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG
load: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG
load: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG
load: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG
load: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG
load: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG
load: control token: 128010 '<|python_tag|>' is not marked as EOG
load: control token: 128006 '<|start_header_id|>' is not marked as EOG
load: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG
load: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG
load: control token: 128000 '<|begin_of_text|>' is not marked as EOG
load: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG
load: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG
load: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG
load: control token: 128007 '<|end_header_id|>' is not marked as EOG
load: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG
load: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG
load: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG
load: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG
load: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG
load: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG
load: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG
load: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG
load: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG
load: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG
load: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG
load: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG
load: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG
load: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG
load: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG
load: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG
load: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG
load: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG
load: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG
load: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG
load: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG
load: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG
load: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG
load: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG
load: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG
load: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG
load: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG
load: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG
load: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG
load: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG
load: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG
load: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG
load: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG
load: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG
load: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG
load: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG
load: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG
load: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG
load: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG
load: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG
load: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG
load: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG
load: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG
load: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG
load: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG
load: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG
load: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG
load: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG
load: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG
load: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG
load: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG
load: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG
load: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG
load: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG
load: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG
load: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG
load: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG
load: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG
load: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG
load: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG
load: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG
load: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG
load: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG
load: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG
load: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG
load: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG
load: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG
load: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG
load: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG
load: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG
load: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG
load: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG
load: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG
load: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG
load: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG
load: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG
load: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG
load: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG
load: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG
load: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG
load: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG
load: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG
load: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG
load: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG
load: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG
load: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG
load: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG
load: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG
load: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG
load: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG
load: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG
load: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG
load: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG
load: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG
load: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG
load: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG
load: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG
load: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG
load: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG
load: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG
load: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG
load: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG
load: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG
load: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG
load: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG
load: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG
load: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG
load: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG
load: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG
load: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG
load: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-11-04T05:37:38.158-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --port 52557"
time=2025-11-04T05:37:38.158-06:00 level=DEBUG source=server.go:401 msg=subprocess OLLAMA_DEBUG=1 PATH="/Users/pjv/.local/bin:/Users/pjv/go/bin:/opt/homebrew/sbin:/opt/homebrew/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin" OLLAMA_HOST=0.0.0.0 OLLAMA_MODELS=/Users/pjv/.ollama/models OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_KEEP_ALIVE=-1 DYLD_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources OLLAMA_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources
time=2025-11-04T05:37:38.163-06:00 level=INFO source=server.go:470 msg="system memory" total="16.0 GiB" free="6.6 GiB" free_swap="0 B"
time=2025-11-04T05:37:38.163-06:00 level=DEBUG source=memory.go:198 msg=evaluating library=Metal gpu_count=1 available="[2.7 GiB]"
time=2025-11-04T05:37:38.163-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=llama.vision.block_count default=0
time=2025-11-04T05:37:38.163-06:00 level=DEBUG source=ggml.go:611 msg="default cache size estimate" "attention MiB"=1792 "attention bytes"=1879048192 "recurrent MiB"=0 "recurrent bytes"=0
time=2025-11-04T05:37:38.164-06:00 level=INFO source=server.go:483 msg="model requires more memory than is currently available, evicting a model to make space" estimate.library="" estimate.layers.requested=0 estimate.layers.model=0 estimate.layers.offload=0 estimate.layers.split=[] estimate.memory.available=[] estimate.memory.gpu_overhead="0 B" estimate.memory.required.full="0 B" estimate.memory.required.partial="0 B" estimate.memory.required.kv="0 B" estimate.memory.required.allocations=[] estimate.memory.weights.total="0 B" estimate.memory.weights.repeating="0 B" estimate.memory.weights.nonrepeating="0 B" estimate.memory.graph.full="0 B" estimate.memory.graph.partial="0 B"
time=2025-11-04T05:37:38.164-06:00 level=DEBUG source=sched.go:804 msg="no idle runners, picking the shortest duration" runner_count=1 runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:37:38.164-06:00 level=DEBUG source=sched.go:229 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 refCount=1
time=2025-11-04T05:37:38.164-06:00 level=DEBUG source=sched.go:240 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:37:38.176-06:00 level=INFO source=runner.go:910 msg="starting go runner"
time=2025-11-04T05:37:38.176-06:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/Applications/Ollama.app/Contents/Resources
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.023 sec
ggml_metal_device_init: GPU name:   Apple M2
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
time=2025-11-04T05:37:38.177-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
time=2025-11-04T05:37:38.282-06:00 level=INFO source=runner.go:946 msg="Server listening on 127.0.0.1:52557"
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q8_0_f32', name = 'kernel_mul_mm_q8_0_f32_bci=0_bco=1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q8_0_f32_bci=0_bco=1            0x14d60f190 | th_max =  896 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_get_rows_f32', name = 'kernel_get_rows_f32'
ggml_metal_library_compile_pipeline: loaded kernel_get_rows_f32                           0x14d70f030 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_1', name = 'kernel_add_row_c4_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_1                      0x14d70f520 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q8_0_f32', name = 'kernel_mul_mv_q8_0_f32_nsg=4'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q8_0_f32_nsg=4                  0x14d60f830 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_f16_dk128_dv128', name = 'kernel_flash_attn_ext_vec_f16_dk128_dv128_mask=1_sink=0_bias=0_scap=0_kvpad=0_ns10=1024_ns20=1024_nsg=4_nwg=32'
ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_f16_dk128_dv128_mask=1_sink=0_bias=0_scap=0_kvpad=0_ns10=1024_ns20=1024_nsg=4_nwg=32      0x14d6107d0 | th_max =  448 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_reduce', name = 'kernel_flash_attn_ext_vec_reduce_dv=128_nwg=32'
ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_reduce_dv=128_nwg=32      0x14d610be0 | th_max = 1024 | th_width =   32
[GIN] 2025/11/04 - 05:38:22 | 200 |  48.20005125s |  100.116.113.16 | POST     "/v1/chat/completions"
time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=sched.go:513 msg="context for request finished"
time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=sched.go:276 msg="runner with zero duration has gone idle, expiring to unload" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=sched.go:301 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 refCount=0
time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=sched.go:304 msg="runner expired event received" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=sched.go:319 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=sched.go:342 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=sched.go:650 msg="no need to wait for VRAM recovery" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000
time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=server.go:1699 msg="stopping llama server" pid=21138
time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=server.go:1705 msg="waiting for llama server to exit" pid=21138
time=2025-11-04T05:38:22.969-06:00 level=DEBUG source=server.go:1709 msg="llama server stopped" pid=21138
time=2025-11-04T05:38:22.969-06:00 level=DEBUG source=sched.go:351 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a
time=2025-11-04T05:38:22.969-06:00 level=DEBUG source=sched.go:354 msg="sending an unloaded event" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a
time=2025-11-04T05:38:22.969-06:00 level=DEBUG source=sched.go:246 msg="unload completed" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a
time=2025-11-04T05:38:22.969-06:00 level=DEBUG source=runner.go:41 msg="overall device VRAM discovery took" duration=500ns
time=2025-11-04T05:38:22.979-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
time=2025-11-04T05:38:22.979-06:00 level=DEBUG source=sched.go:204 msg="loading first model" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2025-11-04T05:38:23.007-06:00 level=INFO source=server.go:470 msg="system memory" total="16.0 GiB" free="1.9 GiB" free_swap="0 B"
time=2025-11-04T05:38:23.007-06:00 level=DEBUG source=memory.go:198 msg=evaluating library=Metal gpu_count=1 available="[2.7 GiB]"
time=2025-11-04T05:38:23.007-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=llama.vision.block_count default=0
time=2025-11-04T05:38:23.008-06:00 level=DEBUG source=ggml.go:611 msg="default cache size estimate" "attention MiB"=1792 "attention bytes"=1879048192 "recurrent MiB"=0 "recurrent bytes"=0
time=2025-11-04T05:38:23.008-06:00 level=DEBUG source=memory.go:198 msg=evaluating library=Metal gpu_count=1 available="[2.7 GiB]"
time=2025-11-04T05:38:23.008-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=llama.vision.block_count default=0
time=2025-11-04T05:38:23.008-06:00 level=DEBUG source=ggml.go:611 msg="default cache size estimate" "attention MiB"=1792 "attention bytes"=1879048192 "recurrent MiB"=0 "recurrent bytes"=0
time=2025-11-04T05:38:23.008-06:00 level=INFO source=server.go:522 msg=offload library=Metal layers.requested=-1 layers.model=29 layers.offload=10 layers.split=[10] memory.available="[2.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.1 GiB" memory.required.partial="2.6 GiB" memory.required.kv="1.8 GiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.6 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="824.0 MiB"
time=2025-11-04T05:38:23.013-06:00 level=INFO source=runner.go:845 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:4 GPULayers:10[ID:0 Layers:10(18..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-04T05:38:23.016-06:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free
time=2025-11-04T05:38:23.016-06:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG
load: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG
load: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG
load: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG
load: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG
load: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG
load: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG
load: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG
load: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG
load: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG
load: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG
load: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG
load: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG
load: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG
load: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG
load: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG
load: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG
load: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG
load: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG
load: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG
load: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG
load: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG
load: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG
load: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG
load: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG
load: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG
load: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG
load: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG
load: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG
load: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG
load: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG
load: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG
load: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG
load: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG
load: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG
load: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG
load: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG
load: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG
load: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG
load: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG
load: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG
load: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG
load: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG
load: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG
load: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG
load: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG
load: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG
load: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG
load: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG
load: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG
load: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG
load: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG
load: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG
load: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG
load: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG
load: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG
load: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG
load: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG
load: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG
load: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG
load: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG
load: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG
load: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG
load: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG
load: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG
load: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG
load: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG
load: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG
load: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG
load: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG
load: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG
load: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG
load: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG
load: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG
load: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG
load: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG
load: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG
load: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG
load: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG
load: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG
load: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG
load: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG
load: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG
load: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG
load: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG
load: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG
load: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG
load: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG
load: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG
load: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG
load: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG
load: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG
load: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG
load: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG
load: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG
load: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG
load: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG
load: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG
load: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG
load: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG
load: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG
load: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG
load: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG
load: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG
load: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG
load: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG
load: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG
load: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG
load: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG
load: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG
load: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG
load: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG
load: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG
load: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG
load: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG
load: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG
load: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG
load: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG
load: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG
load: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG
load: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG
load: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG
load: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG
load: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG
load: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG
load: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG
load: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG
load: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG
load: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG
load: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG
load: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG
load: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG
load: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG
load: control token: 128010 '<|python_tag|>' is not marked as EOG
load: control token: 128006 '<|start_header_id|>' is not marked as EOG
load: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG
load: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG
load: control token: 128000 '<|begin_of_text|>' is not marked as EOG
load: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG
load: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG
load: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG
load: control token: 128007 '<|end_header_id|>' is not marked as EOG
load: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG
load: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG
load: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG
load: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG
load: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG
load: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG
load: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG
load: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG
load: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG
load: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG
load: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG
load: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG
load: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG
load: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG
load: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG
load: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG
load: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG
load: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG
load: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG
load: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG
load: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG
load: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG
load: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG
load: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG
load: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG
load: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG
load: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG
load: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG
load: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG
load: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG
load: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG
load: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG
load: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG
load: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG
load: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG
load: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG
load: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG
load: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG
load: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG
load: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG
load: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG
load: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG
load: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG
load: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG
load: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG
load: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG
load: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG
load: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG
load: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG
load: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG
load: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG
load: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG
load: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG
load: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG
load: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG
load: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG
load: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG
load: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG
load: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG
load: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG
load: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG
load: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG
load: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG
load: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG
load: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG
load: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG
load: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG
load: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG
load: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG
load: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG
load: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG
load: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG
load: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG
load: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG
load: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG
load: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG
load: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG
load: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG
load: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG
load: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG
load: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG
load: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG
load: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG
load: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG
load: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG
load: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG
load: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG
load: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG
load: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG
load: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG
load: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG
load: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG
load: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG
load: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG
load: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG
load: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG
load: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG
load: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG
load: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG
load: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG
load: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG
load: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG
load: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG
load: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG
load: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG
load: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG
load: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG
load: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG
load: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG
load: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG
load: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3072
print_info: n_layer          = 28
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 3B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CPU, is_swa = 0
load_tensors: layer   6 assigned to device CPU, is_swa = 0
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 0
load_tensors: layer   9 assigned to device CPU, is_swa = 0
load_tensors: layer  10 assigned to device CPU, is_swa = 0
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 0
load_tensors: layer  13 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 0
load_tensors: layer  15 assigned to device CPU, is_swa = 0
load_tensors: layer  16 assigned to device CPU, is_swa = 0
load_tensors: layer  17 assigned to device CPU, is_swa = 0
load_tensors: layer  18 assigned to device Metal, is_swa = 0
load_tensors: layer  19 assigned to device Metal, is_swa = 0
load_tensors: layer  20 assigned to device Metal, is_swa = 0
load_tensors: layer  21 assigned to device Metal, is_swa = 0
load_tensors: layer  22 assigned to device Metal, is_swa = 0
load_tensors: layer  23 assigned to device Metal, is_swa = 0
load_tensors: layer  24 assigned to device Metal, is_swa = 0
load_tensors: layer  25 assigned to device Metal, is_swa = 0
load_tensors: layer  26 assigned to device Metal, is_swa = 0
load_tensors: layer  27 assigned to device Metal, is_swa = 0
load_tensors: layer  28 assigned to device CPU, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor rope_freqs.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight
create_tensor: loading tensor blk.1.attn_k.weight
create_tensor: loading tensor blk.1.attn_v.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate.weight
create_tensor: loading tensor blk.1.ffn_down.weight
create_tensor: loading tensor blk.1.ffn_up.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_q.weight
create_tensor: loading tensor blk.2.attn_k.weight
create_tensor: loading tensor blk.2.attn_v.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate.weight
create_tensor: loading tensor blk.2.ffn_down.weight
create_tensor: loading tensor blk.2.ffn_up.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate.weight
create_tensor: loading tensor blk.3.ffn_down.weight
create_tensor: loading tensor blk.3.ffn_up.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_q.weight
create_tensor: loading tensor blk.4.attn_k.weight
create_tensor: loading tensor blk.4.attn_v.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate.weight
create_tensor: loading tensor blk.4.ffn_down.weight
create_tensor: loading tensor blk.4.ffn_up.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.attn_q.weight
create_tensor: loading tensor blk.5.attn_k.weight
create_tensor: loading tensor blk.5.attn_v.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate.weight
create_tensor: loading tensor blk.5.ffn_down.weight
create_tensor: loading tensor blk.5.ffn_up.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_q.weight
create_tensor: loading tensor blk.6.attn_k.weight
create_tensor: loading tensor blk.6.attn_v.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate.weight
create_tensor: loading tensor blk.6.ffn_down.weight
create_tensor: loading tensor blk.6.ffn_up.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate.weight
create_tensor: loading tensor blk.7.ffn_down.weight
create_tensor: loading tensor blk.7.ffn_up.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_q.weight
create_tensor: loading tensor blk.8.attn_k.weight
create_tensor: loading tensor blk.8.attn_v.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate.weight
create_tensor: loading tensor blk.8.ffn_down.weight
create_tensor: loading tensor blk.8.ffn_up.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_q.weight
create_tensor: loading tensor blk.9.attn_k.weight
create_tensor: loading tensor blk.9.attn_v.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate.weight
create_tensor: loading tensor blk.9.ffn_down.weight
create_tensor: loading tensor blk.9.ffn_up.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_q.weight
create_tensor: loading tensor blk.10.attn_k.weight
create_tensor: loading tensor blk.10.attn_v.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate.weight
create_tensor: loading tensor blk.10.ffn_down.weight
create_tensor: loading tensor blk.10.ffn_up.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_v.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate.weight
create_tensor: loading tensor blk.11.ffn_down.weight
create_tensor: loading tensor blk.11.ffn_up.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_q.weight
create_tensor: loading tensor blk.12.attn_k.weight
create_tensor: loading tensor blk.12.attn_v.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate.weight
create_tensor: loading tensor blk.12.ffn_down.weight
create_tensor: loading tensor blk.12.ffn_up.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_q.weight
create_tensor: loading tensor blk.13.attn_k.weight
create_tensor: loading tensor blk.13.attn_v.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate.weight
create_tensor: loading tensor blk.13.ffn_down.weight
create_tensor: loading tensor blk.13.ffn_up.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_q.weight
create_tensor: loading tensor blk.14.attn_k.weight
create_tensor: loading tensor blk.14.attn_v.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate.weight
create_tensor: loading tensor blk.14.ffn_down.weight
create_tensor: loading tensor blk.14.ffn_up.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate.weight
create_tensor: loading tensor blk.15.ffn_down.weight
create_tensor: loading tensor blk.15.ffn_up.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_q.weight
create_tensor: loading tensor blk.16.attn_k.weight
create_tensor: loading tensor blk.16.attn_v.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate.weight
create_tensor: loading tensor blk.16.ffn_down.weight
create_tensor: loading tensor blk.16.ffn_up.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.attn_q.weight
create_tensor: loading tensor blk.17.attn_k.weight
create_tensor: loading tensor blk.17.attn_v.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate.weight
create_tensor: loading tensor blk.17.ffn_down.weight
create_tensor: loading tensor blk.17.ffn_up.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_q.weight
create_tensor: loading tensor blk.18.attn_k.weight
create_tensor: loading tensor blk.18.attn_v.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor rope_freqs.weight
create_tensor: loading tensor blk.18.ffn_gate.weight
create_tensor: loading tensor blk.18.ffn_down.weight
create_tensor: loading tensor blk.18.ffn_up.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate.weight
create_tensor: loading tensor blk.19.ffn_down.weight
create_tensor: loading tensor blk.19.ffn_up.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.attn_q.weight
create_tensor: loading tensor blk.20.attn_k.weight
create_tensor: loading tensor blk.20.attn_v.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.ffn_norm.weight
create_tensor: loading tensor blk.20.ffn_gate.weight
create_tensor: loading tensor blk.20.ffn_down.weight
create_tensor: loading tensor blk.20.ffn_up.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.attn_q.weight
create_tensor: loading tensor blk.21.attn_k.weight
create_tensor: loading tensor blk.21.attn_v.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.ffn_norm.weight
create_tensor: loading tensor blk.21.ffn_gate.weight
create_tensor: loading tensor blk.21.ffn_down.weight
create_tensor: loading tensor blk.21.ffn_up.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.attn_q.weight
create_tensor: loading tensor blk.22.attn_k.weight
create_tensor: loading tensor blk.22.attn_v.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.ffn_norm.weight
create_tensor: loading tensor blk.22.ffn_gate.weight
create_tensor: loading tensor blk.22.ffn_down.weight
create_tensor: loading tensor blk.22.ffn_up.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_v.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.ffn_norm.weight
create_tensor: loading tensor blk.23.ffn_gate.weight
create_tensor: loading tensor blk.23.ffn_down.weight
create_tensor: loading tensor blk.23.ffn_up.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.attn_q.weight
create_tensor: loading tensor blk.24.attn_k.weight
create_tensor: loading tensor blk.24.attn_v.weight
create_tensor: loading tensor blk.24.attn_output.weight
create_tensor: loading tensor blk.24.ffn_norm.weight
create_tensor: loading tensor blk.24.ffn_gate.weight
create_tensor: loading tensor blk.24.ffn_down.weight
create_tensor: loading tensor blk.24.ffn_up.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.attn_q.weight
create_tensor: loading tensor blk.25.attn_k.weight
create_tensor: loading tensor blk.25.attn_v.weight
create_tensor: loading tensor blk.25.attn_output.weight
create_tensor: loading tensor blk.25.ffn_norm.weight
create_tensor: loading tensor blk.25.ffn_gate.weight
create_tensor: loading tensor blk.25.ffn_down.weight
create_tensor: loading tensor blk.25.ffn_up.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.attn_q.weight
create_tensor: loading tensor blk.26.attn_k.weight
create_tensor: loading tensor blk.26.attn_v.weight
create_tensor: loading tensor blk.26.attn_output.weight
create_tensor: loading tensor blk.26.ffn_norm.weight
create_tensor: loading tensor blk.26.ffn_gate.weight
create_tensor: loading tensor blk.26.ffn_down.weight
create_tensor: loading tensor blk.26.ffn_up.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.attn_q.weight
create_tensor: loading tensor blk.27.attn_k.weight
create_tensor: loading tensor blk.27.attn_v.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.ffn_norm.weight
create_tensor: loading tensor blk.27.ffn_gate.weight
create_tensor: loading tensor blk.27.ffn_down.weight
create_tensor: loading tensor blk.27.ffn_up.weight
load_tensors: offloading 10 repeating layers to GPU
load_tensors: offloaded 10/29 layers to GPU
load_tensors:          CPU model buffer size =  1330.17 MiB
load_tensors:        Metal model buffer size =   588.19 MiB
load_all_data: no device found for buffer type CPU for async uploads
time=2025-11-04T05:38:23.519-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.50"
load_all_data: device Metal does not support async, host buffers or events
time=2025-11-04T05:38:23.771-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.83"
llama_init_from_model: model default pooling_type is [0], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: use bfloat         = true
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.50 MiB
create_memory: n_ctx = 16384 (padded)
llama_kv_cache: layer   0: dev = CPU
llama_kv_cache: layer   1: dev = CPU
llama_kv_cache: layer   2: dev = CPU
llama_kv_cache: layer   3: dev = CPU
llama_kv_cache: layer   4: dev = CPU
llama_kv_cache: layer   5: dev = CPU
llama_kv_cache: layer   6: dev = CPU
llama_kv_cache: layer   7: dev = CPU
llama_kv_cache: layer   8: dev = CPU
llama_kv_cache: layer   9: dev = CPU
llama_kv_cache: layer  10: dev = CPU
llama_kv_cache: layer  11: dev = CPU
llama_kv_cache: layer  12: dev = CPU
llama_kv_cache: layer  13: dev = CPU
llama_kv_cache: layer  14: dev = CPU
llama_kv_cache: layer  15: dev = CPU
llama_kv_cache: layer  16: dev = CPU
llama_kv_cache: layer  17: dev = CPU
llama_kv_cache: layer  18: dev = Metal
llama_kv_cache: layer  19: dev = Metal
llama_kv_cache: layer  20: dev = Metal
llama_kv_cache: layer  21: dev = Metal
llama_kv_cache: layer  22: dev = Metal
llama_kv_cache: layer  23: dev = Metal
llama_kv_cache: layer  24: dev = Metal
llama_kv_cache: layer  25: dev = Metal
llama_kv_cache: layer  26: dev = Metal
llama_kv_cache: layer  27: dev = Metal
llama_kv_cache:        CPU KV buffer size =  1152.00 MiB
llama_kv_cache:      Metal KV buffer size =   640.00 MiB
llama_kv_cache: size = 1792.00 MiB ( 16384 cells,  28 layers,  1/1 seqs), K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: max_nodes = 2048
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:      Metal compute buffer size =   816.01 MiB
llama_context:        CPU compute buffer size =   828.01 MiB
llama_context: graph nodes  = 1014
llama_context: graph splits = 255 (with bs=512), 3 (with bs=1)
time=2025-11-04T05:38:24.021-06:00 level=INFO source=server.go:1289 msg="llama runner started in 45.86 seconds"
time=2025-11-04T05:38:24.021-06:00 level=INFO source=sched.go:493 msg="loaded runners" count=1
time=2025-11-04T05:38:24.021-06:00 level=DEBUG source=sched.go:602 msg="evaluating already loaded" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2025-11-04T05:38:24.021-06:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-11-04T05:38:24.021-06:00 level=INFO source=server.go:1289 msg="llama runner started in 45.86 seconds"
time=2025-11-04T05:38:24.021-06:00 level=DEBUG source=sched.go:505 msg="finished setting up" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="5.1 GiB" runner.vram="2.6 GiB" runner.parallel=1 runner.pid=21139 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384
time=2025-11-04T05:38:24.021-06:00 level=DEBUG source=sched.go:602 msg="evaluating already loaded" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2025-11-04T05:38:24.023-06:00 level=DEBUG source=server.go:1401 msg="completion request" images=0 prompt=1079 format=""
time=2025-11-04T05:38:24.024-06:00 level=DEBUG source=server.go:1401 msg="completion request" images=0 prompt=1079 format=""
time=2025-11-04T05:38:24.024-06:00 level=DEBUG source=server.go:1401 msg="completion request" images=0 prompt=1023 format=""
time=2025-11-04T05:38:24.024-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=229 used=0 remaining=229
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_fuse_1', name = 'kernel_add_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_fuse_1                             0x145e07be0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_mul_f32_4', name = 'kernel_rms_norm_mul_f32_4'
ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_mul_f32_4                     0x145e08510 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q4_K_f32', name = 'kernel_mul_mm_q4_K_f32_bci=0_bco=1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q4_K_f32_bci=0_bco=1            0x145e08f20 | th_max =  896 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rope_norm_f32', name = 'kernel_rope_norm_f32'
ggml_metal_library_compile_pipeline: loaded kernel_rope_norm_f32                          0x145e09180 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_set_rows_f16_i64', name = 'kernel_set_rows_f16_i64'
ggml_metal_library_compile_pipeline: loaded kernel_set_rows_f16_i64                       0x145e093e0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_f16_f32', name = 'kernel_mul_mm_f16_f32_bci=0_bco=1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_f16_f32_bci=0_bco=1             0x145e0a200 | th_max =  896 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_soft_max_f32_4', name = 'kernel_soft_max_f32_4'
ggml_metal_library_compile_pipeline: loaded kernel_soft_max_f32_4                         0x145e098e0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_cpy_f32_f32', name = 'kernel_cpy_f32_f32'
ggml_metal_library_compile_pipeline: loaded kernel_cpy_f32_f32                            0x145e0a610 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_swiglu_f32', name = 'kernel_swiglu_f32'
ggml_metal_library_compile_pipeline: loaded kernel_swiglu_f32                             0x145e0ac80 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q6_K_f32', name = 'kernel_mul_mm_q6_K_f32_bci=0_bco=1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q6_K_f32_bci=0_bco=1            0x145e0bd30 | th_max =  896 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_get_rows_f32', name = 'kernel_get_rows_f32'
ggml_metal_library_compile_pipeline: loaded kernel_get_rows_f32                           0x129b09ae0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_1', name = 'kernel_add_row_c4_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_1                      0x129b0a110 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q4_K_f32', name = 'kernel_mul_mv_q4_K_f32_nsg=2'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q4_K_f32_nsg=2                  0x129b0b2d0 | th_max =  768 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q6_K_f32', name = 'kernel_mul_mv_q6_K_f32_nsg=2'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q6_K_f32_nsg=2                  0x129b0b880 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_f32_4', name = 'kernel_rms_norm_f32_4'
ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_f32_4                         0x129b0a9a0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_f16_f32_4', name = 'kernel_mul_mv_f16_f32_4_nsg=1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_f16_f32_4_nsg=1                 0x145e0d210 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_f16_f32_4', name = 'kernel_mul_mv_f16_f32_4_nsg=2'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_f16_f32_4_nsg=2                 0x145e0ca90 | th_max = 1024 | th_width =   32
[GIN] 2025/11/04 - 05:38:25 | 200 |    51.105175s |  100.116.113.16 | POST     "/v1/chat/completions"
time=2025-11-04T05:38:25.848-06:00 level=DEBUG source=sched.go:513 msg="context for request finished"
time=2025-11-04T05:38:25.848-06:00 level=DEBUG source=sched.go:301 msg="after processing request finished event" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="5.1 GiB" runner.vram="2.6 GiB" runner.parallel=1 runner.pid=21139 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 refCount=2
time=2025-11-04T05:38:25.849-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=233 prompt=229 used=228 remaining=1
[GIN] 2025/11/04 - 05:38:26 | 200 |  3.073442708s |  100.116.113.16 | POST     "/v1/chat/completions"
time=2025-11-04T05:38:26.050-06:00 level=DEBUG source=sched.go:378 msg="context for request finished" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="5.1 GiB" runner.vram="2.6 GiB" runner.parallel=1 runner.pid=21139 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384
time=2025-11-04T05:38:26.050-06:00 level=DEBUG source=sched.go:301 msg="after processing request finished event" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="5.1 GiB" runner.vram="2.6 GiB" runner.parallel=1 runner.pid=21139 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 refCount=1
time=2025-11-04T05:38:26.051-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=234 prompt=220 used=208 remaining=12
[GIN] 2025/11/04 - 05:38:26 | 200 | 51.766675292s |  100.116.113.16 | POST     "/v1/chat/completions"
time=2025-11-04T05:38:26.503-06:00 level=DEBUG source=sched.go:378 msg="context for request finished" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="5.1 GiB" runner.vram="2.6 GiB" runner.parallel=1 runner.pid=21139 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384
time=2025-11-04T05:38:26.503-06:00 level=DEBUG source=sched.go:283 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="5.1 GiB" runner.vram="2.6 GiB" runner.parallel=1 runner.pid=21139 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 duration=2562047h47m16.854775807s
time=2025-11-04T05:38:26.503-06:00 level=DEBUG source=sched.go:301 msg="after processing request finished event" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="5.1 GiB" runner.vram="2.6 GiB" runner.parallel=1 runner.pid=21139 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 refCount=0

@pjv commented on GitHub (Nov 4, 2025): @jessegross > Can you please post the logs with this problem with OLLAMA_DEBUG=1 set? <details> <summary>0.12.6 log</summary> ```log time=2025-11-04T05:29:21.064-06:00 level=DEBUG source=sched.go:123 msg="starting llm scheduler" time=2025-11-04T05:29:21.064-06:00 level=INFO source=runner.go:80 msg="discovering available GPUs..." time=2025-11-04T05:29:21.064-06:00 level=DEBUG source=runner.go:448 msg="spawning runner with" OLLAMA_LIBRARY_PATH=[/Applications/Ollama.app/Contents/Resources] extra_envs=[] time=2025-11-04T05:29:21.202-06:00 level=DEBUG source=runner.go:451 msg="bootstrap discovery took" duration=137.794375ms OLLAMA_LIBRARY_PATH=[/Applications/Ollama.app/Contents/Resources] extra_envs=[] time=2025-11-04T05:29:21.202-06:00 level=DEBUG source=runner.go:118 msg="filtering out unsupported or overlapping GPU library combinations" count=1 time=2025-11-04T05:29:21.202-06:00 level=DEBUG source=runner.go:45 msg="GPU bootstrap discovery took" duration=138.08875ms time=2025-11-04T05:29:21.202-06:00 level=INFO source=types.go:112 msg="inference compute" id=0 library=Metal compute=0.0 name=Metal description="Apple M2" libdirs="" driver=0.0 pci_id=00:00.0 type=discrete total="10.7 GiB" available="10.7 GiB" time=2025-11-04T05:29:21.202-06:00 level=INFO source=routes.go:1605 msg="entering low vram mode" "total vram"="10.7 GiB" threshold="20.0 GiB" [GIN] 2025/11/04 - 05:29:21 | 200 | 152.333µs | 127.0.0.1 | HEAD "/" [GIN] 2025/11/04 - 05:29:21 | 200 | 325.583µs | 127.0.0.1 | GET "/api/ps" time=2025-11-04T05:30:19.430-06:00 level=DEBUG source=runner.go:45 msg="overall device VRAM discovery took" duration=333ns time=2025-11-04T05:30:19.430-06:00 level=DEBUG source=sched.go:195 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-11-04T05:30:19.440-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 time=2025-11-04T05:30:19.440-06:00 level=DEBUG source=sched.go:215 msg="loading first model" model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0 time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.type default="" time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.factor default=1 time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.original_context_length default=0 time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_count default=0 time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_used_count default=0 time=2025-11-04T05:30:19.471-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.norm_top_k_prob default=true time=2025-11-04T05:30:19.471-06:00 level=INFO source=server.go:216 msg="enabling flash attention" time=2025-11-04T05:30:19.472-06:00 level=DEBUG source=server.go:331 msg="adding gpu dependency paths" paths=[/Applications/Ollama.app/Contents/Resources] time=2025-11-04T05:30:19.472-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a --port 52409" time=2025-11-04T05:30:19.472-06:00 level=DEBUG source=server.go:401 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_HOST=0.0.0.0 OLLAMA_KEEP_ALIVE=-1 OLLAMA_MODELS=/Users/pjv/.ollama/models PATH="/Users/pjv/.local/bin:/Users/pjv/go/bin:/opt/homebrew/sbin:/opt/homebrew/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_MAX_LOADED_MODELS=3 DYLD_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources:/Applications/Ollama.app/Contents/Resources:/Applications/Ollama.app/Contents/Resources OLLAMA_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources:/Applications/Ollama.app/Contents/Resources time=2025-11-04T05:30:19.473-06:00 level=INFO source=server.go:676 msg="loading model" "model layers"=37 requested=-1 time=2025-11-04T05:30:19.473-06:00 level=INFO source=server.go:682 msg="system memory" total="16.0 GiB" free="6.7 GiB" free_swap="0 B" time=2025-11-04T05:30:19.473-06:00 level=INFO source=server.go:690 msg="gpu memory" id=0 library=Metal available="10.7 GiB" free="10.7 GiB" minimum="0 B" overhead="0 B" time=2025-11-04T05:30:19.481-06:00 level=INFO source=runner.go:1332 msg="starting ollama engine" time=2025-11-04T05:30:19.481-06:00 level=INFO source=runner.go:1367 msg="Server listening on 127.0.0.1:52409" time=2025-11-04T05:30:19.484-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-04T05:30:19.499-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 time=2025-11-04T05:30:19.499-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.description default="" time=2025-11-04T05:30:19.499-06:00 level=INFO source=ggml.go:134 msg="" architecture=qwen3 file_type=Q8_0 name="Qwen3 4B Instruct 2507" description="" num_tensors=398 num_key_values=33 time=2025-11-04T05:30:19.499-06:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/Applications/Ollama.app/Contents/Resources ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.006 sec ggml_metal_device_init: GPU name: Apple M2 ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 11453.25 MB time=2025-11-04T05:30:19.500-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M2 ggml_metal_init: use bfloat = true ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0 time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.type default="" time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.factor default=1 time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.original_context_length default=0 time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_count default=0 time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_used_count default=0 time=2025-11-04T05:30:19.584-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.norm_top_k_prob default=true time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=ggml.go:837 msg="compute graph" nodes=1230 splits=2 time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=device.go:206 msg="model weights" device=Metal size="4.0 GiB" time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=device.go:211 msg="model weights" device=CPU size="394.1 MiB" time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=device.go:217 msg="kv cache" device=Metal size="3.9 GiB" time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=device.go:228 msg="compute graph" device=Metal size="102.5 MiB" time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=device.go:233 msg="compute graph" device=CPU size="5.0 MiB" time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=device.go:238 msg="total memory" size="8.3 GiB" time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=server.go:721 msg=memory success=true required.InputWeights=413265920 required.CPU.Graph=5242880 required.Metal.ID=0 required.Metal.Weights="[107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 413276160]" required.Metal.Cache="[115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 0]" required.Metal.Graph=107509280 time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=server.go:915 msg="available gpu" id=0 library=Metal "available layer vram"="10.6 GiB" backoff=0.00 minimum="0 B" overhead="0 B" graph="102.5 MiB" time=2025-11-04T05:30:19.586-06:00 level=DEBUG source=server.go:732 msg="new layout created" layers="37[ID:0 Layers:37(0..36)]" time=2025-11-04T05:30:19.586-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-04T05:30:19.600-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0 time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.type default="" time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.factor default=1 time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.original_context_length default=0 time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_count default=0 time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_used_count default=0 time=2025-11-04T05:30:19.602-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.norm_top_k_prob default=true time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=ggml.go:837 msg="compute graph" nodes=1230 splits=2 time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=device.go:206 msg="model weights" device=Metal size="4.0 GiB" time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=device.go:211 msg="model weights" device=CPU size="394.1 MiB" time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=device.go:217 msg="kv cache" device=Metal size="3.9 GiB" time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=device.go:228 msg="compute graph" device=Metal size="102.5 MiB" time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=device.go:233 msg="compute graph" device=CPU size="5.0 MiB" time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=device.go:238 msg="total memory" size="8.3 GiB" time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=server.go:721 msg=memory success=true required.InputWeights=413265920 required.CPU.Graph=5242880 required.Metal.ID=0 required.Metal.Weights="[107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 413276160]" required.Metal.Cache="[115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 0]" required.Metal.Graph=107509280 time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=server.go:915 msg="available gpu" id=0 library=Metal "available layer vram"="10.6 GiB" backoff=0.00 minimum="0 B" overhead="0 B" graph="102.5 MiB" time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=server.go:732 msg="new layout created" layers="37[ID:0 Layers:37(0..36)]" time=2025-11-04T05:30:19.886-06:00 level=INFO source=runner.go:1205 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-04T05:30:19.886-06:00 level=INFO source=ggml.go:480 msg="offloading 36 repeating layers to GPU" time=2025-11-04T05:30:19.886-06:00 level=INFO source=ggml.go:487 msg="offloading output layer to GPU" time=2025-11-04T05:30:19.886-06:00 level=INFO source=ggml.go:492 msg="offloaded 37/37 layers to GPU" time=2025-11-04T05:30:19.886-06:00 level=INFO source=device.go:206 msg="model weights" device=Metal size="4.0 GiB" time=2025-11-04T05:30:19.886-06:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="394.1 MiB" time=2025-11-04T05:30:19.886-06:00 level=INFO source=device.go:217 msg="kv cache" device=Metal size="3.9 GiB" time=2025-11-04T05:30:19.886-06:00 level=INFO source=device.go:228 msg="compute graph" device=Metal size="102.5 MiB" time=2025-11-04T05:30:19.886-06:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="5.0 MiB" time=2025-11-04T05:30:19.886-06:00 level=INFO source=device.go:238 msg="total memory" size="8.3 GiB" time=2025-11-04T05:30:19.886-06:00 level=INFO source=sched.go:482 msg="loaded runners" count=1 time=2025-11-04T05:30:19.886-06:00 level=DEBUG source=runner.go:45 msg="overall device VRAM discovery took" duration=250ns time=2025-11-04T05:30:19.886-06:00 level=INFO source=server.go:1272 msg="waiting for llama runner to start responding" time=2025-11-04T05:30:19.887-06:00 level=INFO source=server.go:1306 msg="waiting for server to become available" status="llm server loading model" time=2025-11-04T05:30:19.897-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 time=2025-11-04T05:30:20.138-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.16" time=2025-11-04T05:30:20.388-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.31" time=2025-11-04T05:30:20.639-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.47" time=2025-11-04T05:30:20.890-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.61" time=2025-11-04T05:30:21.141-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.73" time=2025-11-04T05:30:21.393-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.86" time=2025-11-04T05:30:21.643-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.97" time=2025-11-04T05:30:21.894-06:00 level=DEBUG source=server.go:1316 msg="model load progress 0.98" time=2025-11-04T05:30:21.997-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0 time=2025-11-04T05:30:22.147-06:00 level=INFO source=server.go:1310 msg="llama runner started in 2.67 seconds" time=2025-11-04T05:30:22.147-06:00 level=DEBUG source=sched.go:494 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:30:22.147-06:00 level=DEBUG source=sched.go:534 msg="gpu reported" gpu=0 library=Metal available="10.7 GiB" time=2025-11-04T05:30:22.147-06:00 level=INFO source=sched.go:545 msg="updated VRAM based on existing loaded models" gpu=0 library=Metal total="10.7 GiB" available="2.7 GiB" time=2025-11-04T05:30:22.231-06:00 level=DEBUG source=server.go:1422 msg="completion request" images=0 prompt=43962 format="" time=2025-11-04T05:30:22.257-06:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=0 prompt=10012 used=0 remaining=10012 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_mul_f32_4', name = 'kernel_rms_norm_mul_f32_4' ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_mul_f32_4 0x142706f80 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q8_0_f32', name = 'kernel_mul_mm_q8_0_f32_bci=0_bco=0' ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q8_0_f32_bci=0_bco=0 0x142708040 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rope_neox_f32', name = 'kernel_rope_neox_f32' ggml_metal_library_compile_pipeline: loaded kernel_rope_neox_f32 0x142707710 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_cpy_f32_f16', name = 'kernel_cpy_f32_f16' ggml_metal_library_compile_pipeline: loaded kernel_cpy_f32_f16 0x1427087a0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_blk', name = 'kernel_flash_attn_ext_blk_nqptg=8_ncpsg=64' ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_blk_nqptg=8_ncpsg=64 0x142708e60 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_f16_dk128_dv128', name = 'kernel_flash_attn_ext_f16_dk128_dv128_mask=1_sinks=0_bias=0_scap=0_kvpad=0_bcm=0_ns10=1024_ns20=1024_nsg=4' ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_f16_dk128_dv128_mask=1_sinks=0_bias=0_scap=0_kvpad=0_bcm=0_ns10=1024_ns20=1024_nsg=4 0x142709740 | th_max = 768 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_fuse_1', name = 'kernel_add_fuse_1' ggml_metal_library_compile_pipeline: loaded kernel_add_fuse_1 0x1427092e0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_swiglu_f32', name = 'kernel_swiglu_f32' ggml_metal_library_compile_pipeline: loaded kernel_swiglu_f32 0x142709cb0 | th_max = 1024 | th_width = 32 ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.028 sec ggml_metal_device_init: GPU name: Apple M2 ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 11453.25 MB llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 1.87 GiB (5.01 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG load: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG load: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG load: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG load: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG load: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG load: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG load: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG load: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG load: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG load: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG load: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG load: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG load: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG load: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG load: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG load: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG load: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG load: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG load: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG load: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG load: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG load: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG load: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG load: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG load: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG load: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG load: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG load: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG load: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG load: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG load: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG load: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG load: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG load: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG load: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG load: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG load: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG load: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG load: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG load: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG load: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG load: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG load: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG load: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG load: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG load: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG load: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG load: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG load: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG load: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG load: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG load: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG load: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG load: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG load: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG load: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG load: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG load: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG load: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG load: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG load: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG load: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG load: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG load: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG load: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG load: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG load: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG load: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG load: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG load: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG load: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG load: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG load: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG load: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG load: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG load: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG load: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG load: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG load: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG load: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG load: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG load: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG load: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG load: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG load: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG load: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG load: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG load: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG load: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG load: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG load: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG load: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG load: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG load: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG load: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG load: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG load: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG load: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG load: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG load: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG load: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG load: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG load: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG load: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG load: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG load: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG load: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG load: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG load: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG load: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG load: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG load: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG load: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG load: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG load: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG load: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG load: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG load: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG load: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG load: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG load: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG load: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG load: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG load: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG load: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG load: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG load: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG load: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG load: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG load: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG load: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG load: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG load: control token: 128010 '<|python_tag|>' is not marked as EOG load: control token: 128006 '<|start_header_id|>' is not marked as EOG load: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG load: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG load: control token: 128000 '<|begin_of_text|>' is not marked as EOG load: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG load: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG load: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG load: control token: 128007 '<|end_header_id|>' is not marked as EOG load: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG load: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG load: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG load: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG load: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG load: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG load: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG load: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG load: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG load: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG load: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG load: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG load: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG load: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG load: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG load: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG load: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG load: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG load: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG load: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG load: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG load: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG load: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG load: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG load: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG load: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG load: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG load: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG load: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG load: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG load: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG load: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG load: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG load: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG load: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG load: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG load: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG load: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG load: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG load: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG load: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG load: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG load: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG load: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG load: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG load: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG load: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG load: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG load: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG load: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG load: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG load: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG load: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG load: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG load: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG load: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG load: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG load: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG load: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG load: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG load: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG load: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG load: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG load: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG load: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG load: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG load: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG load: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG load: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG load: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG load: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG load: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG load: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG load: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG load: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG load: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG load: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG load: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG load: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG load: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG load: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG load: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG load: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG load: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG load: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG load: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG load: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG load: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG load: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG load: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG load: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG load: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG load: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG load: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG load: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG load: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG load: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG load: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG load: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG load: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG load: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG load: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG load: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG load: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG load: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG load: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG load: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG load: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG load: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG load: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG load: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG load: printing all EOG tokens: load: - 128001 ('<|end_of_text|>') load: - 128008 ('<|eom_id|>') load: - 128009 ('<|eot_id|>') load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 3.21 B print_info: general.name = Llama 3.2 3B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128001 '<|end_of_text|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128001 '<|end_of_text|>' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-11-04T05:30:22.969-06:00 level=DEBUG source=server.go:331 msg="adding gpu dependency paths" paths=[/Applications/Ollama.app/Contents/Resources] time=2025-11-04T05:30:22.969-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --port 52412" time=2025-11-04T05:30:22.969-06:00 level=DEBUG source=server.go:401 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_HOST=0.0.0.0 OLLAMA_KEEP_ALIVE=-1 OLLAMA_MODELS=/Users/pjv/.ollama/models PATH="/Users/pjv/.local/bin:/Users/pjv/go/bin:/opt/homebrew/sbin:/opt/homebrew/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_MAX_LOADED_MODELS=3 DYLD_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources:/Applications/Ollama.app/Contents/Resources:/Applications/Ollama.app/Contents/Resources OLLAMA_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources:/Applications/Ollama.app/Contents/Resources time=2025-11-04T05:30:22.973-06:00 level=INFO source=server.go:505 msg="system memory" total="16.0 GiB" free="3.6 GiB" free_swap="0 B" time=2025-11-04T05:30:22.973-06:00 level=DEBUG source=memory.go:181 msg=evaluating library=Metal gpu_count=1 available="[2.7 GiB]" time=2025-11-04T05:30:22.974-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=llama.vision.block_count default=0 time=2025-11-04T05:30:22.974-06:00 level=DEBUG source=ggml.go:611 msg="default cache size estimate" "attention MiB"=1792 "attention bytes"=1879048192 "recurrent MiB"=0 "recurrent bytes"=0 time=2025-11-04T05:30:22.974-06:00 level=INFO source=server.go:512 msg="model requires more memory than is currently available, evicting a model to make space" estimate.library="" estimate.layers.requested=0 estimate.layers.model=0 estimate.layers.offload=0 estimate.layers.split=[] estimate.memory.available=[] estimate.memory.gpu_overhead="0 B" estimate.memory.required.full="0 B" estimate.memory.required.partial="0 B" estimate.memory.required.kv="0 B" estimate.memory.required.allocations=[] estimate.memory.weights.total="0 B" estimate.memory.weights.repeating="0 B" estimate.memory.weights.nonrepeating="0 B" estimate.memory.graph.full="0 B" estimate.memory.graph.partial="0 B" time=2025-11-04T05:30:22.974-06:00 level=DEBUG source=sched.go:787 msg="no idle runners, picking the shortest duration" runner_count=1 runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:30:22.974-06:00 level=DEBUG source=sched.go:240 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 refCount=1 time=2025-11-04T05:30:22.974-06:00 level=DEBUG source=sched.go:251 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:30:22.987-06:00 level=INFO source=runner.go:893 msg="starting go runner" time=2025-11-04T05:30:22.987-06:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/Applications/Ollama.app/Contents/Resources ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.019 sec ggml_metal_device_init: GPU name: Apple M2 ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 11453.25 MB time=2025-11-04T05:30:22.988-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) time=2025-11-04T05:30:23.087-06:00 level=INFO source=runner.go:929 msg="Server listening on 127.0.0.1:52412" ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q8_0_f32', name = 'kernel_mul_mm_q8_0_f32_bci=0_bco=1' ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q8_0_f32_bci=0_bco=1 0x142713fb0 | th_max = 896 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_get_rows_f32', name = 'kernel_get_rows_f32' ggml_metal_library_compile_pipeline: loaded kernel_get_rows_f32 0x142613cf0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_1', name = 'kernel_add_row_c4_fuse_1' ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_1 0x142614350 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q8_0_f32', name = 'kernel_mul_mv_q8_0_f32_nsg=4' ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q8_0_f32_nsg=4 0x142614ee0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_f16_dk128_dv128', name = 'kernel_flash_attn_ext_vec_f16_dk128_dv128_mask=1_sink=0_bias=0_scap=0_kvpad=0_ns10=1024_ns20=1024_nsg=4_nwg=32' ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_f16_dk128_dv128_mask=1_sink=0_bias=0_scap=0_kvpad=0_ns10=1024_ns20=1024_nsg=4_nwg=32 0x142615140 | th_max = 448 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_reduce', name = 'kernel_flash_attn_ext_vec_reduce_dv=128_nwg=32' ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_reduce_dv=128_nwg=32 0x142615550 | th_max = 1024 | th_width = 32 [GIN] 2025/11/04 - 05:31:07 | 200 | 47.855582791s | 100.116.113.16 | POST "/v1/chat/completions" time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=sched.go:502 msg="context for request finished" time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=sched.go:287 msg="runner with zero duration has gone idle, expiring to unload" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=sched.go:312 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 refCount=0 time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=sched.go:315 msg="runner expired event received" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=sched.go:330 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=sched.go:353 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=sched.go:637 msg="no need to wait for VRAM recovery" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=server.go:1720 msg="stopping llama server" pid=20963 time=2025-11-04T05:31:07.179-06:00 level=DEBUG source=server.go:1726 msg="waiting for llama server to exit" pid=20963 time=2025-11-04T05:31:07.229-06:00 level=DEBUG source=server.go:1730 msg="llama server stopped" pid=20963 time=2025-11-04T05:31:07.229-06:00 level=DEBUG source=sched.go:362 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a time=2025-11-04T05:31:07.229-06:00 level=DEBUG source=sched.go:365 msg="sending an unloaded event" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a time=2025-11-04T05:31:07.229-06:00 level=DEBUG source=sched.go:257 msg="unload completed" runner.size="8.3 GiB" runner.vram="8.3 GiB" runner.parallel=1 runner.pid=20963 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a time=2025-11-04T05:31:07.229-06:00 level=DEBUG source=runner.go:45 msg="overall device VRAM discovery took" duration=584ns time=2025-11-04T05:31:07.242-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 time=2025-11-04T05:31:07.242-06:00 level=DEBUG source=sched.go:215 msg="loading first model" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff time=2025-11-04T05:31:07.274-06:00 level=INFO source=server.go:505 msg="system memory" total="16.0 GiB" free="8.1 GiB" free_swap="0 B" time=2025-11-04T05:31:07.274-06:00 level=DEBUG source=memory.go:181 msg=evaluating library=Metal gpu_count=1 available="[10.7 GiB]" time=2025-11-04T05:31:07.274-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=llama.vision.block_count default=0 time=2025-11-04T05:31:07.274-06:00 level=DEBUG source=ggml.go:611 msg="default cache size estimate" "attention MiB"=1792 "attention bytes"=1879048192 "recurrent MiB"=0 "recurrent bytes"=0 time=2025-11-04T05:31:07.274-06:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff library=Metal parallel=1 required="4.6 GiB" gpus=1 time=2025-11-04T05:31:07.274-06:00 level=DEBUG source=memory.go:181 msg=evaluating library=Metal gpu_count=1 available="[10.7 GiB]" time=2025-11-04T05:31:07.274-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=llama.vision.block_count default=0 time=2025-11-04T05:31:07.274-06:00 level=DEBUG source=ggml.go:611 msg="default cache size estimate" "attention MiB"=1792 "attention bytes"=1879048192 "recurrent MiB"=0 "recurrent bytes"=0 time=2025-11-04T05:31:07.274-06:00 level=INFO source=server.go:545 msg=offload library=Metal layers.requested=-1 layers.model=29 layers.offload=29 layers.split=[29] memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.6 GiB" memory.required.partial="4.6 GiB" memory.required.kv="1.8 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.6 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="824.0 MiB" time=2025-11-04T05:31:07.275-06:00 level=INFO source=runner.go:828 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:4 GPULayers:29[ID:0 Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}" time=2025-11-04T05:31:07.276-06:00 level=INFO source=server.go:1272 msg="waiting for llama runner to start responding" llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free time=2025-11-04T05:31:07.276-06:00 level=INFO source=server.go:1306 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 1.87 GiB (5.01 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG load: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG load: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG load: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG load: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG load: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG load: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG load: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG load: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG load: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG load: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG load: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG load: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG load: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG load: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG load: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG load: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG load: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG load: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG load: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG load: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG load: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG load: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG load: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG load: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG load: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG load: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG load: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG load: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG load: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG load: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG load: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG load: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG load: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG load: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG load: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG load: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG load: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG load: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG load: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG load: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG load: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG load: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG load: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG load: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG load: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG load: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG load: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG load: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG load: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG load: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG load: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG load: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG load: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG load: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG load: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG load: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG load: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG load: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG load: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG load: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG load: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG load: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG load: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG load: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG load: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG load: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG load: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG load: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG load: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG load: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG load: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG load: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG load: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG load: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG load: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG load: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG load: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG load: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG load: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG load: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG load: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG load: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG load: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG load: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG load: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG load: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG load: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG load: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG load: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG load: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG load: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG load: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG load: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG load: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG load: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG load: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG load: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG load: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG load: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG load: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG load: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG load: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG load: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG load: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG load: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG load: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG load: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG load: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG load: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG load: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG load: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG load: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG load: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG load: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG load: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG load: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG load: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG load: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG load: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG load: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG load: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG load: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG load: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG load: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG load: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG load: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG load: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG load: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG load: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG load: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG load: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG load: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG load: control token: 128010 '<|python_tag|>' is not marked as EOG load: control token: 128006 '<|start_header_id|>' is not marked as EOG load: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG load: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG load: control token: 128000 '<|begin_of_text|>' is not marked as EOG load: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG load: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG load: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG load: control token: 128007 '<|end_header_id|>' is not marked as EOG load: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG load: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG load: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG load: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG load: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG load: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG load: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG load: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG load: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG load: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG load: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG load: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG load: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG load: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG load: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG load: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG load: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG load: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG load: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG load: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG load: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG load: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG load: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG load: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG load: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG load: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG load: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG load: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG load: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG load: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG load: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG load: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG load: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG load: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG load: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG load: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG load: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG load: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG load: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG load: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG load: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG load: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG load: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG load: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG load: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG load: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG load: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG load: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG load: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG load: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG load: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG load: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG load: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG load: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG load: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG load: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG load: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG load: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG load: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG load: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG load: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG load: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG load: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG load: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG load: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG load: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG load: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG load: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG load: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG load: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG load: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG load: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG load: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG load: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG load: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG load: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG load: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG load: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG load: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG load: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG load: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG load: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG load: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG load: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG load: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG load: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG load: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG load: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG load: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG load: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG load: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG load: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG load: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG load: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG load: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG load: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG load: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG load: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG load: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG load: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG load: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG load: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG load: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG load: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG load: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG load: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG load: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG load: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG load: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG load: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG load: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG load: printing all EOG tokens: load: - 128001 ('<|end_of_text|>') load: - 128008 ('<|eom_id|>') load: - 128009 ('<|eot_id|>') load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 3072 print_info: n_layer = 28 print_info: n_head = 24 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 3 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 8192 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: model type = 3B print_info: model params = 3.21 B print_info: general.name = Llama 3.2 3B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128001 '<|end_of_text|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128001 '<|end_of_text|>' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device Metal, is_swa = 0 load_tensors: layer 1 assigned to device Metal, is_swa = 0 load_tensors: layer 2 assigned to device Metal, is_swa = 0 load_tensors: layer 3 assigned to device Metal, is_swa = 0 load_tensors: layer 4 assigned to device Metal, is_swa = 0 load_tensors: layer 5 assigned to device Metal, is_swa = 0 load_tensors: layer 6 assigned to device Metal, is_swa = 0 load_tensors: layer 7 assigned to device Metal, is_swa = 0 load_tensors: layer 8 assigned to device Metal, is_swa = 0 load_tensors: layer 9 assigned to device Metal, is_swa = 0 load_tensors: layer 10 assigned to device Metal, is_swa = 0 load_tensors: layer 11 assigned to device Metal, is_swa = 0 load_tensors: layer 12 assigned to device Metal, is_swa = 0 load_tensors: layer 13 assigned to device Metal, is_swa = 0 load_tensors: layer 14 assigned to device Metal, is_swa = 0 load_tensors: layer 15 assigned to device Metal, is_swa = 0 load_tensors: layer 16 assigned to device Metal, is_swa = 0 load_tensors: layer 17 assigned to device Metal, is_swa = 0 load_tensors: layer 18 assigned to device Metal, is_swa = 0 load_tensors: layer 19 assigned to device Metal, is_swa = 0 load_tensors: layer 20 assigned to device Metal, is_swa = 0 load_tensors: layer 21 assigned to device Metal, is_swa = 0 load_tensors: layer 22 assigned to device Metal, is_swa = 0 load_tensors: layer 23 assigned to device Metal, is_swa = 0 load_tensors: layer 24 assigned to device Metal, is_swa = 0 load_tensors: layer 25 assigned to device Metal, is_swa = 0 load_tensors: layer 26 assigned to device Metal, is_swa = 0 load_tensors: layer 27 assigned to device Metal, is_swa = 0 load_tensors: layer 28 assigned to device Metal, is_swa = 0 create_tensor: loading tensor token_embd.weight create_tensor: loading tensor output_norm.weight create_tensor: loading tensor token_embd.weight create_tensor: loading tensor blk.0.attn_norm.weight create_tensor: loading tensor blk.0.attn_q.weight create_tensor: loading tensor blk.0.attn_k.weight create_tensor: loading tensor blk.0.attn_v.weight create_tensor: loading tensor blk.0.attn_output.weight create_tensor: loading tensor blk.0.ffn_norm.weight create_tensor: loading tensor rope_freqs.weight create_tensor: loading tensor blk.0.ffn_gate.weight create_tensor: loading tensor blk.0.ffn_down.weight create_tensor: loading tensor blk.0.ffn_up.weight create_tensor: loading tensor blk.1.attn_norm.weight create_tensor: loading tensor blk.1.attn_q.weight create_tensor: loading tensor blk.1.attn_k.weight create_tensor: loading tensor blk.1.attn_v.weight create_tensor: loading tensor blk.1.attn_output.weight create_tensor: loading tensor blk.1.ffn_norm.weight create_tensor: loading tensor blk.1.ffn_gate.weight create_tensor: loading tensor blk.1.ffn_down.weight create_tensor: loading tensor blk.1.ffn_up.weight create_tensor: loading tensor blk.2.attn_norm.weight create_tensor: loading tensor blk.2.attn_q.weight create_tensor: loading tensor blk.2.attn_k.weight create_tensor: loading tensor blk.2.attn_v.weight create_tensor: loading tensor blk.2.attn_output.weight create_tensor: loading tensor blk.2.ffn_norm.weight create_tensor: loading tensor blk.2.ffn_gate.weight create_tensor: loading tensor blk.2.ffn_down.weight create_tensor: loading tensor blk.2.ffn_up.weight create_tensor: loading tensor blk.3.attn_norm.weight create_tensor: loading tensor blk.3.attn_q.weight create_tensor: loading tensor blk.3.attn_k.weight create_tensor: loading tensor blk.3.attn_v.weight create_tensor: loading tensor blk.3.attn_output.weight create_tensor: loading tensor blk.3.ffn_norm.weight create_tensor: loading tensor blk.3.ffn_gate.weight create_tensor: loading tensor blk.3.ffn_down.weight create_tensor: loading tensor blk.3.ffn_up.weight create_tensor: loading tensor blk.4.attn_norm.weight create_tensor: loading tensor blk.4.attn_q.weight create_tensor: loading tensor blk.4.attn_k.weight create_tensor: loading tensor blk.4.attn_v.weight create_tensor: loading tensor blk.4.attn_output.weight create_tensor: loading tensor blk.4.ffn_norm.weight create_tensor: loading tensor blk.4.ffn_gate.weight create_tensor: loading tensor blk.4.ffn_down.weight create_tensor: loading tensor blk.4.ffn_up.weight create_tensor: loading tensor blk.5.attn_norm.weight create_tensor: loading tensor blk.5.attn_q.weight create_tensor: loading tensor blk.5.attn_k.weight create_tensor: loading tensor blk.5.attn_v.weight create_tensor: loading tensor blk.5.attn_output.weight create_tensor: loading tensor blk.5.ffn_norm.weight create_tensor: loading tensor blk.5.ffn_gate.weight create_tensor: loading tensor blk.5.ffn_down.weight create_tensor: loading tensor blk.5.ffn_up.weight create_tensor: loading tensor blk.6.attn_norm.weight create_tensor: loading tensor blk.6.attn_q.weight create_tensor: loading tensor blk.6.attn_k.weight create_tensor: loading tensor blk.6.attn_v.weight create_tensor: loading tensor blk.6.attn_output.weight create_tensor: loading tensor blk.6.ffn_norm.weight create_tensor: loading tensor blk.6.ffn_gate.weight create_tensor: loading tensor blk.6.ffn_down.weight create_tensor: loading tensor blk.6.ffn_up.weight create_tensor: loading tensor blk.7.attn_norm.weight create_tensor: loading tensor blk.7.attn_q.weight create_tensor: loading tensor blk.7.attn_k.weight create_tensor: loading tensor blk.7.attn_v.weight create_tensor: loading tensor blk.7.attn_output.weight create_tensor: loading tensor blk.7.ffn_norm.weight create_tensor: loading tensor blk.7.ffn_gate.weight create_tensor: loading tensor blk.7.ffn_down.weight create_tensor: loading tensor blk.7.ffn_up.weight create_tensor: loading tensor blk.8.attn_norm.weight create_tensor: loading tensor blk.8.attn_q.weight create_tensor: loading tensor blk.8.attn_k.weight create_tensor: loading tensor blk.8.attn_v.weight create_tensor: loading tensor blk.8.attn_output.weight create_tensor: loading tensor blk.8.ffn_norm.weight create_tensor: loading tensor blk.8.ffn_gate.weight create_tensor: loading tensor blk.8.ffn_down.weight create_tensor: loading tensor blk.8.ffn_up.weight create_tensor: loading tensor blk.9.attn_norm.weight create_tensor: loading tensor blk.9.attn_q.weight create_tensor: loading tensor blk.9.attn_k.weight create_tensor: loading tensor blk.9.attn_v.weight create_tensor: loading tensor blk.9.attn_output.weight create_tensor: loading tensor blk.9.ffn_norm.weight create_tensor: loading tensor blk.9.ffn_gate.weight create_tensor: loading tensor blk.9.ffn_down.weight create_tensor: loading tensor blk.9.ffn_up.weight create_tensor: loading tensor blk.10.attn_norm.weight create_tensor: loading tensor blk.10.attn_q.weight create_tensor: loading tensor blk.10.attn_k.weight create_tensor: loading tensor blk.10.attn_v.weight create_tensor: loading tensor blk.10.attn_output.weight create_tensor: loading tensor blk.10.ffn_norm.weight create_tensor: loading tensor blk.10.ffn_gate.weight create_tensor: loading tensor blk.10.ffn_down.weight create_tensor: loading tensor blk.10.ffn_up.weight create_tensor: loading tensor blk.11.attn_norm.weight create_tensor: loading tensor blk.11.attn_q.weight create_tensor: loading tensor blk.11.attn_k.weight create_tensor: loading tensor blk.11.attn_v.weight create_tensor: loading tensor blk.11.attn_output.weight create_tensor: loading tensor blk.11.ffn_norm.weight create_tensor: loading tensor blk.11.ffn_gate.weight create_tensor: loading tensor blk.11.ffn_down.weight create_tensor: loading tensor blk.11.ffn_up.weight create_tensor: loading tensor blk.12.attn_norm.weight create_tensor: loading tensor blk.12.attn_q.weight create_tensor: loading tensor blk.12.attn_k.weight create_tensor: loading tensor blk.12.attn_v.weight create_tensor: loading tensor blk.12.attn_output.weight create_tensor: loading tensor blk.12.ffn_norm.weight create_tensor: loading tensor blk.12.ffn_gate.weight create_tensor: loading tensor blk.12.ffn_down.weight create_tensor: loading tensor blk.12.ffn_up.weight create_tensor: loading tensor blk.13.attn_norm.weight create_tensor: loading tensor blk.13.attn_q.weight create_tensor: loading tensor blk.13.attn_k.weight create_tensor: loading tensor blk.13.attn_v.weight create_tensor: loading tensor blk.13.attn_output.weight create_tensor: loading tensor blk.13.ffn_norm.weight create_tensor: loading tensor blk.13.ffn_gate.weight create_tensor: loading tensor blk.13.ffn_down.weight create_tensor: loading tensor blk.13.ffn_up.weight create_tensor: loading tensor blk.14.attn_norm.weight create_tensor: loading tensor blk.14.attn_q.weight create_tensor: loading tensor blk.14.attn_k.weight create_tensor: loading tensor blk.14.attn_v.weight create_tensor: loading tensor blk.14.attn_output.weight create_tensor: loading tensor blk.14.ffn_norm.weight create_tensor: loading tensor blk.14.ffn_gate.weight create_tensor: loading tensor blk.14.ffn_down.weight create_tensor: loading tensor blk.14.ffn_up.weight create_tensor: loading tensor blk.15.attn_norm.weight create_tensor: loading tensor blk.15.attn_q.weight create_tensor: loading tensor blk.15.attn_k.weight create_tensor: loading tensor blk.15.attn_v.weight create_tensor: loading tensor blk.15.attn_output.weight create_tensor: loading tensor blk.15.ffn_norm.weight create_tensor: loading tensor blk.15.ffn_gate.weight create_tensor: loading tensor blk.15.ffn_down.weight create_tensor: loading tensor blk.15.ffn_up.weight create_tensor: loading tensor blk.16.attn_norm.weight create_tensor: loading tensor blk.16.attn_q.weight create_tensor: loading tensor blk.16.attn_k.weight create_tensor: loading tensor blk.16.attn_v.weight create_tensor: loading tensor blk.16.attn_output.weight create_tensor: loading tensor blk.16.ffn_norm.weight create_tensor: loading tensor blk.16.ffn_gate.weight create_tensor: loading tensor blk.16.ffn_down.weight create_tensor: loading tensor blk.16.ffn_up.weight create_tensor: loading tensor blk.17.attn_norm.weight create_tensor: loading tensor blk.17.attn_q.weight create_tensor: loading tensor blk.17.attn_k.weight create_tensor: loading tensor blk.17.attn_v.weight create_tensor: loading tensor blk.17.attn_output.weight create_tensor: loading tensor blk.17.ffn_norm.weight create_tensor: loading tensor blk.17.ffn_gate.weight create_tensor: loading tensor blk.17.ffn_down.weight create_tensor: loading tensor blk.17.ffn_up.weight create_tensor: loading tensor blk.18.attn_norm.weight create_tensor: loading tensor blk.18.attn_q.weight create_tensor: loading tensor blk.18.attn_k.weight create_tensor: loading tensor blk.18.attn_v.weight create_tensor: loading tensor blk.18.attn_output.weight create_tensor: loading tensor blk.18.ffn_norm.weight create_tensor: loading tensor blk.18.ffn_gate.weight create_tensor: loading tensor blk.18.ffn_down.weight create_tensor: loading tensor blk.18.ffn_up.weight create_tensor: loading tensor blk.19.attn_norm.weight create_tensor: loading tensor blk.19.attn_q.weight create_tensor: loading tensor blk.19.attn_k.weight create_tensor: loading tensor blk.19.attn_v.weight create_tensor: loading tensor blk.19.attn_output.weight create_tensor: loading tensor blk.19.ffn_norm.weight create_tensor: loading tensor blk.19.ffn_gate.weight create_tensor: loading tensor blk.19.ffn_down.weight create_tensor: loading tensor blk.19.ffn_up.weight create_tensor: loading tensor blk.20.attn_norm.weight create_tensor: loading tensor blk.20.attn_q.weight create_tensor: loading tensor blk.20.attn_k.weight create_tensor: loading tensor blk.20.attn_v.weight create_tensor: loading tensor blk.20.attn_output.weight create_tensor: loading tensor blk.20.ffn_norm.weight create_tensor: loading tensor blk.20.ffn_gate.weight create_tensor: loading tensor blk.20.ffn_down.weight create_tensor: loading tensor blk.20.ffn_up.weight create_tensor: loading tensor blk.21.attn_norm.weight create_tensor: loading tensor blk.21.attn_q.weight create_tensor: loading tensor blk.21.attn_k.weight create_tensor: loading tensor blk.21.attn_v.weight create_tensor: loading tensor blk.21.attn_output.weight create_tensor: loading tensor blk.21.ffn_norm.weight create_tensor: loading tensor blk.21.ffn_gate.weight create_tensor: loading tensor blk.21.ffn_down.weight create_tensor: loading tensor blk.21.ffn_up.weight create_tensor: loading tensor blk.22.attn_norm.weight create_tensor: loading tensor blk.22.attn_q.weight create_tensor: loading tensor blk.22.attn_k.weight create_tensor: loading tensor blk.22.attn_v.weight create_tensor: loading tensor blk.22.attn_output.weight create_tensor: loading tensor blk.22.ffn_norm.weight create_tensor: loading tensor blk.22.ffn_gate.weight create_tensor: loading tensor blk.22.ffn_down.weight create_tensor: loading tensor blk.22.ffn_up.weight create_tensor: loading tensor blk.23.attn_norm.weight create_tensor: loading tensor blk.23.attn_q.weight create_tensor: loading tensor blk.23.attn_k.weight create_tensor: loading tensor blk.23.attn_v.weight create_tensor: loading tensor blk.23.attn_output.weight create_tensor: loading tensor blk.23.ffn_norm.weight create_tensor: loading tensor blk.23.ffn_gate.weight create_tensor: loading tensor blk.23.ffn_down.weight create_tensor: loading tensor blk.23.ffn_up.weight create_tensor: loading tensor blk.24.attn_norm.weight create_tensor: loading tensor blk.24.attn_q.weight create_tensor: loading tensor blk.24.attn_k.weight create_tensor: loading tensor blk.24.attn_v.weight create_tensor: loading tensor blk.24.attn_output.weight create_tensor: loading tensor blk.24.ffn_norm.weight create_tensor: loading tensor blk.24.ffn_gate.weight create_tensor: loading tensor blk.24.ffn_down.weight create_tensor: loading tensor blk.24.ffn_up.weight create_tensor: loading tensor blk.25.attn_norm.weight create_tensor: loading tensor blk.25.attn_q.weight create_tensor: loading tensor blk.25.attn_k.weight create_tensor: loading tensor blk.25.attn_v.weight create_tensor: loading tensor blk.25.attn_output.weight create_tensor: loading tensor blk.25.ffn_norm.weight create_tensor: loading tensor blk.25.ffn_gate.weight create_tensor: loading tensor blk.25.ffn_down.weight create_tensor: loading tensor blk.25.ffn_up.weight create_tensor: loading tensor blk.26.attn_norm.weight create_tensor: loading tensor blk.26.attn_q.weight create_tensor: loading tensor blk.26.attn_k.weight create_tensor: loading tensor blk.26.attn_v.weight create_tensor: loading tensor blk.26.attn_output.weight create_tensor: loading tensor blk.26.ffn_norm.weight create_tensor: loading tensor blk.26.ffn_gate.weight create_tensor: loading tensor blk.26.ffn_down.weight create_tensor: loading tensor blk.26.ffn_up.weight create_tensor: loading tensor blk.27.attn_norm.weight create_tensor: loading tensor blk.27.attn_q.weight create_tensor: loading tensor blk.27.attn_k.weight create_tensor: loading tensor blk.27.attn_v.weight create_tensor: loading tensor blk.27.attn_output.weight create_tensor: loading tensor blk.27.ffn_norm.weight create_tensor: loading tensor blk.27.ffn_gate.weight create_tensor: loading tensor blk.27.ffn_down.weight create_tensor: loading tensor blk.27.ffn_up.weight load_tensors: offloading 28 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 29/29 layers to GPU load_tensors: CPU_Mapped model buffer size = 308.23 MiB load_tensors: Metal_Mapped model buffer size = 1918.35 MiB llama_init_from_model: model default pooling_type is [0], but [-1] was specified llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 16384 llama_context: n_ctx_per_seq = 16384 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = disabled llama_context: kv_unified = false llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M2 ggml_metal_init: use bfloat = true ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true set_abort_callback: call llama_context: CPU output buffer size = 0.50 MiB create_memory: n_ctx = 16384 (padded) llama_kv_cache: layer 0: dev = Metal llama_kv_cache: layer 1: dev = Metal llama_kv_cache: layer 2: dev = Metal llama_kv_cache: layer 3: dev = Metal llama_kv_cache: layer 4: dev = Metal llama_kv_cache: layer 5: dev = Metal llama_kv_cache: layer 6: dev = Metal llama_kv_cache: layer 7: dev = Metal llama_kv_cache: layer 8: dev = Metal llama_kv_cache: layer 9: dev = Metal llama_kv_cache: layer 10: dev = Metal llama_kv_cache: layer 11: dev = Metal llama_kv_cache: layer 12: dev = Metal llama_kv_cache: layer 13: dev = Metal llama_kv_cache: layer 14: dev = Metal llama_kv_cache: layer 15: dev = Metal llama_kv_cache: layer 16: dev = Metal llama_kv_cache: layer 17: dev = Metal llama_kv_cache: layer 18: dev = Metal llama_kv_cache: layer 19: dev = Metal llama_kv_cache: layer 20: dev = Metal llama_kv_cache: layer 21: dev = Metal llama_kv_cache: layer 22: dev = Metal llama_kv_cache: layer 23: dev = Metal llama_kv_cache: layer 24: dev = Metal llama_kv_cache: layer 25: dev = Metal llama_kv_cache: layer 26: dev = Metal llama_kv_cache: layer 27: dev = Metal llama_kv_cache: Metal KV buffer size = 1792.00 MiB time=2025-11-04T05:31:08.533-06:00 level=DEBUG source=server.go:1316 msg="model load progress 1.00" llama_kv_cache: size = 1792.00 MiB ( 16384 cells, 28 layers, 1/1 seqs), K (f16): 896.00 MiB, V (f16): 896.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 3 llama_context: max_nodes = 2048 llama_context: reserving full memory module llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1 graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512 graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1 graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512 llama_context: Metal compute buffer size = 816.01 MiB llama_context: CPU compute buffer size = 42.01 MiB llama_context: graph nodes = 1014 llama_context: graph splits = 2 time=2025-11-04T05:31:08.785-06:00 level=INFO source=server.go:1310 msg="llama runner started in 45.82 seconds" time=2025-11-04T05:31:08.785-06:00 level=INFO source=sched.go:482 msg="loaded runners" count=1 time=2025-11-04T05:31:08.785-06:00 level=DEBUG source=sched.go:587 msg="evaluating already loaded" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff time=2025-11-04T05:31:08.785-06:00 level=INFO source=server.go:1272 msg="waiting for llama runner to start responding" time=2025-11-04T05:31:08.785-06:00 level=INFO source=server.go:1310 msg="llama runner started in 45.82 seconds" time=2025-11-04T05:31:08.785-06:00 level=DEBUG source=sched.go:494 msg="finished setting up" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="4.6 GiB" runner.vram="4.6 GiB" runner.parallel=1 runner.pid=20964 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 time=2025-11-04T05:31:08.786-06:00 level=DEBUG source=sched.go:587 msg="evaluating already loaded" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff time=2025-11-04T05:31:08.788-06:00 level=DEBUG source=server.go:1422 msg="completion request" images=0 prompt=1079 format="" time=2025-11-04T05:31:08.789-06:00 level=DEBUG source=server.go:1422 msg="completion request" images=0 prompt=1079 format="" time=2025-11-04T05:31:08.789-06:00 level=DEBUG source=server.go:1422 msg="completion request" images=0 prompt=1023 format="" time=2025-11-04T05:31:08.793-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=229 used=0 remaining=229 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_mul_f32_4', name = 'kernel_rms_norm_mul_f32_4' ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_mul_f32_4 0x11ce09d40 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q4_K_f32', name = 'kernel_mul_mm_q4_K_f32_bci=0_bco=1' ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q4_K_f32_bci=0_bco=1 0x11ce0ac50 | th_max = 896 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q6_K_f32', name = 'kernel_mul_mm_q6_K_f32_bci=0_bco=1' ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q6_K_f32_bci=0_bco=1 0x11ce0b200 | th_max = 896 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rope_norm_f32', name = 'kernel_rope_norm_f32' ggml_metal_library_compile_pipeline: loaded kernel_rope_norm_f32 0x10d806d20 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_set_rows_f16_i64', name = 'kernel_set_rows_f16_i64' ggml_metal_library_compile_pipeline: loaded kernel_set_rows_f16_i64 0x10d806f80 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_f16_f32', name = 'kernel_mul_mm_f16_f32_bci=0_bco=1' ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_f16_f32_bci=0_bco=1 0x10d807a10 | th_max = 896 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_soft_max_f32_4', name = 'kernel_soft_max_f32_4' ggml_metal_library_compile_pipeline: loaded kernel_soft_max_f32_4 0x10d807e80 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_cpy_f32_f32', name = 'kernel_cpy_f32_f32' ggml_metal_library_compile_pipeline: loaded kernel_cpy_f32_f32 0x10d808290 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_fuse_1', name = 'kernel_add_fuse_1' ggml_metal_library_compile_pipeline: loaded kernel_add_fuse_1 0x10d8088b0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_swiglu_f32', name = 'kernel_swiglu_f32' ggml_metal_library_compile_pipeline: loaded kernel_swiglu_f32 0x10d809030 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_get_rows_f32', name = 'kernel_get_rows_f32' ggml_metal_library_compile_pipeline: loaded kernel_get_rows_f32 0x10d811900 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_1', name = 'kernel_add_row_c4_fuse_1' ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_1 0x10d811cd0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q4_K_f32', name = 'kernel_mul_mv_q4_K_f32_nsg=2' ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q4_K_f32_nsg=2 0x10d812d30 | th_max = 768 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q6_K_f32', name = 'kernel_mul_mv_q6_K_f32_nsg=2' ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q6_K_f32_nsg=2 0x10d813490 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_f16_f32_4', name = 'kernel_mul_mv_f16_f32_4_nsg=1' ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_f16_f32_4_nsg=1 0x10d815590 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_f16_f32_4', name = 'kernel_mul_mv_f16_f32_4_nsg=2' ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_f16_f32_4_nsg=2 0x11ce07420 | th_max = 1024 | th_width = 32 [GIN] 2025/11/04 - 05:31:09 | 200 | 50.409380041s | 100.116.113.16 | POST "/v1/chat/completions" time=2025-11-04T05:31:09.730-06:00 level=DEBUG source=sched.go:502 msg="context for request finished" time=2025-11-04T05:31:09.730-06:00 level=DEBUG source=sched.go:312 msg="after processing request finished event" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="4.6 GiB" runner.vram="4.6 GiB" runner.parallel=1 runner.pid=20964 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 refCount=2 time=2025-11-04T05:31:09.730-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=236 prompt=229 used=228 remaining=1 [GIN] 2025/11/04 - 05:31:09 | 200 | 2.663052667s | 100.116.113.16 | POST "/v1/chat/completions" time=2025-11-04T05:31:09.940-06:00 level=DEBUG source=sched.go:389 msg="context for request finished" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="4.6 GiB" runner.vram="4.6 GiB" runner.parallel=1 runner.pid=20964 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 time=2025-11-04T05:31:09.940-06:00 level=DEBUG source=sched.go:312 msg="after processing request finished event" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="4.6 GiB" runner.vram="4.6 GiB" runner.parallel=1 runner.pid=20964 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 refCount=1 time=2025-11-04T05:31:09.940-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=236 prompt=220 used=208 remaining=12 [GIN] 2025/11/04 - 05:31:10 | 200 | 50.888554209s | 100.116.113.16 | POST "/v1/chat/completions" time=2025-11-04T05:31:10.211-06:00 level=DEBUG source=sched.go:389 msg="context for request finished" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="4.6 GiB" runner.vram="4.6 GiB" runner.parallel=1 runner.pid=20964 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 time=2025-11-04T05:31:10.211-06:00 level=DEBUG source=sched.go:294 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="4.6 GiB" runner.vram="4.6 GiB" runner.parallel=1 runner.pid=20964 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 duration=2562047h47m16.854775807s time=2025-11-04T05:31:10.211-06:00 level=DEBUG source=sched.go:312 msg="after processing request finished event" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="4.6 GiB" runner.vram="4.6 GiB" runner.parallel=1 runner.pid=20964 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 refCount=0 ``` </details> <details> <summary>0.12.9 log</summary> ```log time=2025-11-04T05:37:14.982-06:00 level=INFO source=runner.go:76 msg="discovering available GPUs..." time=2025-11-04T05:37:14.984-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --port 52548" time=2025-11-04T05:37:14.984-06:00 level=DEBUG source=server.go:401 msg=subprocess OLLAMA_DEBUG=1 PATH="/Users/pjv/.local/bin:/Users/pjv/go/bin:/opt/homebrew/sbin:/opt/homebrew/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin" OLLAMA_HOST=0.0.0.0 OLLAMA_MODELS=/Users/pjv/.ollama/models OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_KEEP_ALIVE=-1 DYLD_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources OLLAMA_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources time=2025-11-04T05:37:15.101-06:00 level=DEBUG source=runner.go:471 msg="bootstrap discovery took" duration=118.973083ms OLLAMA_LIBRARY_PATH=[/Applications/Ollama.app/Contents/Resources] extra_envs=map[] time=2025-11-04T05:37:15.101-06:00 level=DEBUG source=runner.go:120 msg="evluating which if any devices to filter out" initial_count=1 time=2025-11-04T05:37:15.101-06:00 level=DEBUG source=runner.go:41 msg="GPU bootstrap discovery took" duration=119.23475ms time=2025-11-04T05:37:15.101-06:00 level=INFO source=types.go:42 msg="inference compute" id=0 filtered_id="" library=Metal compute=0.0 name=Metal description="Apple M2" libdirs="" driver=0.0 pci_id="" type=discrete total="10.7 GiB" available="10.7 GiB" time=2025-11-04T05:37:15.101-06:00 level=INFO source=routes.go:1618 msg="entering low vram mode" "total vram"="10.7 GiB" threshold="20.0 GiB" [GIN] 2025/11/04 - 05:37:15 | 200 | 110µs | 127.0.0.1 | HEAD "/" [GIN] 2025/11/04 - 05:37:15 | 200 | 239.5µs | 127.0.0.1 | GET "/api/ps" time=2025-11-04T05:37:34.851-06:00 level=DEBUG source=runner.go:41 msg="overall device VRAM discovery took" duration=8.542µs time=2025-11-04T05:37:34.851-06:00 level=DEBUG source=sched.go:189 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-11-04T05:37:34.859-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 time=2025-11-04T05:37:34.860-06:00 level=DEBUG source=sched.go:204 msg="loading first model" model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0 time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.type default="" time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.factor default=1 time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.original_context_length default=0 time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_count default=0 time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_used_count default=0 time=2025-11-04T05:37:34.890-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.norm_top_k_prob default=true time=2025-11-04T05:37:34.890-06:00 level=INFO source=server.go:215 msg="enabling flash attention" time=2025-11-04T05:37:34.891-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a --port 52554" time=2025-11-04T05:37:34.891-06:00 level=DEBUG source=server.go:401 msg=subprocess OLLAMA_DEBUG=1 PATH="/Users/pjv/.local/bin:/Users/pjv/go/bin:/opt/homebrew/sbin:/opt/homebrew/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin" OLLAMA_HOST=0.0.0.0 OLLAMA_MODELS=/Users/pjv/.ollama/models OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_KEEP_ALIVE=-1 DYLD_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources OLLAMA_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources time=2025-11-04T05:37:34.892-06:00 level=INFO source=server.go:653 msg="loading model" "model layers"=37 requested=-1 time=2025-11-04T05:37:34.892-06:00 level=INFO source=server.go:658 msg="system memory" total="16.0 GiB" free="9.8 GiB" free_swap="0 B" time=2025-11-04T05:37:34.892-06:00 level=INFO source=server.go:665 msg="gpu memory" id=0 library=Metal available="10.2 GiB" free="10.7 GiB" minimum="512.0 MiB" overhead="0 B" time=2025-11-04T05:37:34.900-06:00 level=INFO source=runner.go:1349 msg="starting ollama engine" time=2025-11-04T05:37:34.900-06:00 level=INFO source=runner.go:1384 msg="Server listening on 127.0.0.1:52554" time=2025-11-04T05:37:34.903-06:00 level=INFO source=runner.go:1222 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-04T05:37:34.918-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 time=2025-11-04T05:37:34.918-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.description default="" time=2025-11-04T05:37:34.918-06:00 level=INFO source=ggml.go:136 msg="" architecture=qwen3 file_type=Q8_0 name="Qwen3 4B Instruct 2507" description="" num_tensors=398 num_key_values=33 time=2025-11-04T05:37:34.918-06:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/Applications/Ollama.app/Contents/Resources ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.006 sec ggml_metal_device_init: GPU name: Apple M2 ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 11453.25 MB time=2025-11-04T05:37:34.919-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M2 ggml_metal_init: use bfloat = true ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0 time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.type default="" time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.factor default=1 time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.original_context_length default=0 time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_count default=0 time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_used_count default=0 time=2025-11-04T05:37:35.006-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.norm_top_k_prob default=true time=2025-11-04T05:37:35.033-06:00 level=DEBUG source=ggml.go:857 msg="compute graph" nodes=1231 splits=2 time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=ggml.go:857 msg="compute graph" nodes=1231 splits=2 time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=device.go:212 msg="model weights" device=Metal size="4.0 GiB" time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=device.go:217 msg="model weights" device=CPU size="394.1 MiB" time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=device.go:223 msg="kv cache" device=Metal size="3.9 GiB" time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=device.go:234 msg="compute graph" device=Metal size="131.5 MiB" time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=device.go:239 msg="compute graph" device=CPU size="5.0 MiB" time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=device.go:244 msg="total memory" size="8.4 GiB" time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=server.go:695 msg=memory success=true required.InputWeights=413265920 required.CPU.Graph=5242880 required.Metal.ID=0 required.Metal.Weights="[107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 413276160]" required.Metal.Cache="[115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 0]" required.Metal.Graph=137889824 time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=server.go:892 msg="available gpu" id=0 library=Metal "available layer vram"="10.0 GiB" backoff=0.00 minimum="512.0 MiB" overhead="0 B" graph="131.5 MiB" time=2025-11-04T05:37:35.035-06:00 level=DEBUG source=server.go:706 msg="new layout created" layers="37[ID:0 Layers:37(0..36)]" time=2025-11-04T05:37:35.036-06:00 level=INFO source=runner.go:1222 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-04T05:37:35.049-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0 time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.type default="" time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.factor default=1 time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.rope.scaling.original_context_length default=0 time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_count default=0 time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.expert_used_count default=0 time=2025-11-04T05:37:35.051-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.norm_top_k_prob default=true time=2025-11-04T05:37:35.326-06:00 level=DEBUG source=ggml.go:857 msg="compute graph" nodes=1231 splits=2 time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=ggml.go:857 msg="compute graph" nodes=1231 splits=2 time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=device.go:212 msg="model weights" device=Metal size="4.0 GiB" time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=device.go:217 msg="model weights" device=CPU size="394.1 MiB" time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=device.go:223 msg="kv cache" device=Metal size="3.9 GiB" time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=device.go:234 msg="compute graph" device=Metal size="131.5 MiB" time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=device.go:239 msg="compute graph" device=CPU size="5.0 MiB" time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=device.go:244 msg="total memory" size="8.4 GiB" time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=server.go:695 msg=memory success=true required.InputWeights=413265920 required.CPU.Graph=5242880 required.Metal.ID=0 required.Metal.Weights="[107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 107254784 413276160]" required.Metal.Cache="[115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 115343360 0]" required.Metal.Graph=137889824 time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=server.go:892 msg="available gpu" id=0 library=Metal "available layer vram"="10.0 GiB" backoff=0.00 minimum="512.0 MiB" overhead="0 B" graph="131.5 MiB" time=2025-11-04T05:37:35.329-06:00 level=DEBUG source=server.go:706 msg="new layout created" layers="37[ID:0 Layers:37(0..36)]" time=2025-11-04T05:37:35.329-06:00 level=INFO source=runner.go:1222 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:28000 KvCacheType: NumThreads:4 GPULayers:37[ID:0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-04T05:37:35.329-06:00 level=INFO source=ggml.go:482 msg="offloading 36 repeating layers to GPU" time=2025-11-04T05:37:35.329-06:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU" time=2025-11-04T05:37:35.329-06:00 level=INFO source=ggml.go:494 msg="offloaded 37/37 layers to GPU" time=2025-11-04T05:37:35.329-06:00 level=INFO source=device.go:212 msg="model weights" device=Metal size="4.0 GiB" time=2025-11-04T05:37:35.329-06:00 level=INFO source=device.go:217 msg="model weights" device=CPU size="394.1 MiB" time=2025-11-04T05:37:35.329-06:00 level=INFO source=device.go:223 msg="kv cache" device=Metal size="3.9 GiB" time=2025-11-04T05:37:35.330-06:00 level=INFO source=device.go:234 msg="compute graph" device=Metal size="131.5 MiB" time=2025-11-04T05:37:35.330-06:00 level=INFO source=device.go:239 msg="compute graph" device=CPU size="5.0 MiB" time=2025-11-04T05:37:35.330-06:00 level=INFO source=device.go:244 msg="total memory" size="8.4 GiB" time=2025-11-04T05:37:35.330-06:00 level=INFO source=sched.go:493 msg="loaded runners" count=1 time=2025-11-04T05:37:35.330-06:00 level=DEBUG source=runner.go:41 msg="overall device VRAM discovery took" duration=250ns time=2025-11-04T05:37:35.330-06:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-11-04T05:37:35.330-06:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" time=2025-11-04T05:37:35.340-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 time=2025-11-04T05:37:35.580-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.16" time=2025-11-04T05:37:35.831-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.32" time=2025-11-04T05:37:36.081-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.48" time=2025-11-04T05:37:36.331-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.64" time=2025-11-04T05:37:36.582-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.79" time=2025-11-04T05:37:36.834-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.91" time=2025-11-04T05:37:37.086-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.95" time=2025-11-04T05:37:37.337-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.95" time=2025-11-04T05:37:37.802-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=qwen3.pooling_type default=0 time=2025-11-04T05:37:37.840-06:00 level=INFO source=server.go:1289 msg="llama runner started in 2.95 seconds" time=2025-11-04T05:37:37.840-06:00 level=DEBUG source=sched.go:505 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:37:37.840-06:00 level=DEBUG source=sched.go:548 msg="gpu reported" gpu=0 library=Metal available="10.7 GiB" time=2025-11-04T05:37:37.840-06:00 level=INFO source=sched.go:559 msg="updated VRAM based on existing loaded models" gpu=0 library=Metal total="10.7 GiB" available="2.7 GiB" time=2025-11-04T05:37:37.885-06:00 level=DEBUG source=server.go:1401 msg="completion request" images=0 prompt=43962 format="" time=2025-11-04T05:37:37.907-06:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=0 prompt=10012 used=0 remaining=10012 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_mul_f32_4', name = 'kernel_rms_norm_mul_f32_4' ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_mul_f32_4 0x14d70ca50 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q8_0_f32', name = 'kernel_mul_mm_q8_0_f32_bci=0_bco=0' ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q8_0_f32_bci=0_bco=0 0x14d70d840 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rope_neox_f32', name = 'kernel_rope_neox_f32' ggml_metal_library_compile_pipeline: loaded kernel_rope_neox_f32 0x14d70daa0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_cpy_f32_f16', name = 'kernel_cpy_f32_f16' ggml_metal_library_compile_pipeline: loaded kernel_cpy_f32_f16 0x14d70dd00 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_blk', name = 'kernel_flash_attn_ext_blk_nqptg=8_ncpsg=64' ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_blk_nqptg=8_ncpsg=64 0x14d70e660 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_f16_dk128_dv128', name = 'kernel_flash_attn_ext_f16_dk128_dv128_mask=1_sinks=0_bias=0_scap=0_kvpad=0_bcm=0_ns10=1024_ns20=1024_nsg=4' ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_f16_dk128_dv128_mask=1_sinks=0_bias=0_scap=0_kvpad=0_bcm=0_ns10=1024_ns20=1024_nsg=4 0x153f048e0 | th_max = 768 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_fuse_1', name = 'kernel_add_fuse_1' ggml_metal_library_compile_pipeline: loaded kernel_add_fuse_1 0x153f043e0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_swiglu_f32', name = 'kernel_swiglu_f32' ggml_metal_library_compile_pipeline: loaded kernel_swiglu_f32 0x155004380 | th_max = 1024 | th_width = 32 ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.018 sec ggml_metal_device_init: GPU name: Apple M2 ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 11453.25 MB llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 1.87 GiB (5.01 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG load: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG load: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG load: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG load: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG load: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG load: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG load: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG load: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG load: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG load: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG load: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG load: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG load: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG load: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG load: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG load: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG load: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG load: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG load: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG load: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG load: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG load: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG load: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG load: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG load: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG load: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG load: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG load: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG load: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG load: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG load: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG load: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG load: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG load: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG load: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG load: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG load: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG load: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG load: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG load: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG load: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG load: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG load: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG load: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG load: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG load: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG load: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG load: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG load: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG load: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG load: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG load: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG load: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG load: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG load: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG load: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG load: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG load: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG load: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG load: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG load: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG load: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG load: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG load: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG load: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG load: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG load: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG load: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG load: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG load: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG load: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG load: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG load: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG load: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG load: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG load: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG load: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG load: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG load: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG load: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG load: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG load: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG load: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG load: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG load: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG load: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG load: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG load: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG load: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG load: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG load: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG load: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG load: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG load: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG load: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG load: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG load: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG load: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG load: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG load: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG load: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG load: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG load: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG load: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG load: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG load: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG load: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG load: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG load: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG load: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG load: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG load: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG load: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG load: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG load: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG load: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG load: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG load: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG load: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG load: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG load: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG load: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG load: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG load: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG load: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG load: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG load: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG load: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG load: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG load: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG load: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG load: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG load: control token: 128010 '<|python_tag|>' is not marked as EOG load: control token: 128006 '<|start_header_id|>' is not marked as EOG load: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG load: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG load: control token: 128000 '<|begin_of_text|>' is not marked as EOG load: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG load: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG load: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG load: control token: 128007 '<|end_header_id|>' is not marked as EOG load: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG load: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG load: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG load: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG load: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG load: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG load: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG load: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG load: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG load: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG load: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG load: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG load: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG load: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG load: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG load: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG load: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG load: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG load: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG load: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG load: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG load: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG load: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG load: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG load: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG load: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG load: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG load: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG load: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG load: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG load: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG load: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG load: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG load: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG load: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG load: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG load: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG load: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG load: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG load: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG load: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG load: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG load: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG load: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG load: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG load: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG load: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG load: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG load: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG load: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG load: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG load: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG load: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG load: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG load: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG load: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG load: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG load: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG load: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG load: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG load: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG load: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG load: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG load: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG load: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG load: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG load: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG load: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG load: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG load: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG load: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG load: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG load: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG load: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG load: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG load: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG load: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG load: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG load: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG load: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG load: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG load: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG load: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG load: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG load: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG load: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG load: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG load: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG load: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG load: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG load: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG load: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG load: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG load: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG load: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG load: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG load: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG load: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG load: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG load: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG load: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG load: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG load: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG load: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG load: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG load: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG load: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG load: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG load: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG load: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG load: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG load: printing all EOG tokens: load: - 128001 ('<|end_of_text|>') load: - 128008 ('<|eom_id|>') load: - 128009 ('<|eot_id|>') load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 3.21 B print_info: general.name = Llama 3.2 3B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128001 '<|end_of_text|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128001 '<|end_of_text|>' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-11-04T05:37:38.158-06:00 level=INFO source=server.go:400 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --port 52557" time=2025-11-04T05:37:38.158-06:00 level=DEBUG source=server.go:401 msg=subprocess OLLAMA_DEBUG=1 PATH="/Users/pjv/.local/bin:/Users/pjv/go/bin:/opt/homebrew/sbin:/opt/homebrew/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin" OLLAMA_HOST=0.0.0.0 OLLAMA_MODELS=/Users/pjv/.ollama/models OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_KEEP_ALIVE=-1 DYLD_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources OLLAMA_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources time=2025-11-04T05:37:38.163-06:00 level=INFO source=server.go:470 msg="system memory" total="16.0 GiB" free="6.6 GiB" free_swap="0 B" time=2025-11-04T05:37:38.163-06:00 level=DEBUG source=memory.go:198 msg=evaluating library=Metal gpu_count=1 available="[2.7 GiB]" time=2025-11-04T05:37:38.163-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=llama.vision.block_count default=0 time=2025-11-04T05:37:38.163-06:00 level=DEBUG source=ggml.go:611 msg="default cache size estimate" "attention MiB"=1792 "attention bytes"=1879048192 "recurrent MiB"=0 "recurrent bytes"=0 time=2025-11-04T05:37:38.164-06:00 level=INFO source=server.go:483 msg="model requires more memory than is currently available, evicting a model to make space" estimate.library="" estimate.layers.requested=0 estimate.layers.model=0 estimate.layers.offload=0 estimate.layers.split=[] estimate.memory.available=[] estimate.memory.gpu_overhead="0 B" estimate.memory.required.full="0 B" estimate.memory.required.partial="0 B" estimate.memory.required.kv="0 B" estimate.memory.required.allocations=[] estimate.memory.weights.total="0 B" estimate.memory.weights.repeating="0 B" estimate.memory.weights.nonrepeating="0 B" estimate.memory.graph.full="0 B" estimate.memory.graph.partial="0 B" time=2025-11-04T05:37:38.164-06:00 level=DEBUG source=sched.go:804 msg="no idle runners, picking the shortest duration" runner_count=1 runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:37:38.164-06:00 level=DEBUG source=sched.go:229 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 refCount=1 time=2025-11-04T05:37:38.164-06:00 level=DEBUG source=sched.go:240 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:37:38.176-06:00 level=INFO source=runner.go:910 msg="starting go runner" time=2025-11-04T05:37:38.176-06:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/Applications/Ollama.app/Contents/Resources ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.023 sec ggml_metal_device_init: GPU name: Apple M2 ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 11453.25 MB time=2025-11-04T05:37:38.177-06:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) time=2025-11-04T05:37:38.282-06:00 level=INFO source=runner.go:946 msg="Server listening on 127.0.0.1:52557" ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q8_0_f32', name = 'kernel_mul_mm_q8_0_f32_bci=0_bco=1' ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q8_0_f32_bci=0_bco=1 0x14d60f190 | th_max = 896 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_get_rows_f32', name = 'kernel_get_rows_f32' ggml_metal_library_compile_pipeline: loaded kernel_get_rows_f32 0x14d70f030 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_1', name = 'kernel_add_row_c4_fuse_1' ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_1 0x14d70f520 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q8_0_f32', name = 'kernel_mul_mv_q8_0_f32_nsg=4' ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q8_0_f32_nsg=4 0x14d60f830 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_f16_dk128_dv128', name = 'kernel_flash_attn_ext_vec_f16_dk128_dv128_mask=1_sink=0_bias=0_scap=0_kvpad=0_ns10=1024_ns20=1024_nsg=4_nwg=32' ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_f16_dk128_dv128_mask=1_sink=0_bias=0_scap=0_kvpad=0_ns10=1024_ns20=1024_nsg=4_nwg=32 0x14d6107d0 | th_max = 448 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_reduce', name = 'kernel_flash_attn_ext_vec_reduce_dv=128_nwg=32' ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_reduce_dv=128_nwg=32 0x14d610be0 | th_max = 1024 | th_width = 32 [GIN] 2025/11/04 - 05:38:22 | 200 | 48.20005125s | 100.116.113.16 | POST "/v1/chat/completions" time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=sched.go:513 msg="context for request finished" time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=sched.go:276 msg="runner with zero duration has gone idle, expiring to unload" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=sched.go:301 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 refCount=0 time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=sched.go:304 msg="runner expired event received" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=sched.go:319 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=sched.go:342 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=sched.go:650 msg="no need to wait for VRAM recovery" runner.name=registry.ollama.ai/library/qwen3:4b-instruct-28k runner.inference="[{ID:0 Library:Metal}]" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a runner.num_ctx=28000 time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=server.go:1699 msg="stopping llama server" pid=21138 time=2025-11-04T05:38:22.942-06:00 level=DEBUG source=server.go:1705 msg="waiting for llama server to exit" pid=21138 time=2025-11-04T05:38:22.969-06:00 level=DEBUG source=server.go:1709 msg="llama server stopped" pid=21138 time=2025-11-04T05:38:22.969-06:00 level=DEBUG source=sched.go:351 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a time=2025-11-04T05:38:22.969-06:00 level=DEBUG source=sched.go:354 msg="sending an unloaded event" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a time=2025-11-04T05:38:22.969-06:00 level=DEBUG source=sched.go:246 msg="unload completed" runner.size="8.4 GiB" runner.vram="8.4 GiB" runner.parallel=1 runner.pid=21138 runner.model=/Users/pjv/.ollama/models/blobs/sha256-af6e43ab13611311226e6f809f6a39b1a87b6df613bbf79468059f92ce819c4a time=2025-11-04T05:38:22.969-06:00 level=DEBUG source=runner.go:41 msg="overall device VRAM discovery took" duration=500ns time=2025-11-04T05:38:22.979-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 time=2025-11-04T05:38:22.979-06:00 level=DEBUG source=sched.go:204 msg="loading first model" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff time=2025-11-04T05:38:23.007-06:00 level=INFO source=server.go:470 msg="system memory" total="16.0 GiB" free="1.9 GiB" free_swap="0 B" time=2025-11-04T05:38:23.007-06:00 level=DEBUG source=memory.go:198 msg=evaluating library=Metal gpu_count=1 available="[2.7 GiB]" time=2025-11-04T05:38:23.007-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=llama.vision.block_count default=0 time=2025-11-04T05:38:23.008-06:00 level=DEBUG source=ggml.go:611 msg="default cache size estimate" "attention MiB"=1792 "attention bytes"=1879048192 "recurrent MiB"=0 "recurrent bytes"=0 time=2025-11-04T05:38:23.008-06:00 level=DEBUG source=memory.go:198 msg=evaluating library=Metal gpu_count=1 available="[2.7 GiB]" time=2025-11-04T05:38:23.008-06:00 level=DEBUG source=ggml.go:276 msg="key with type not found" key=llama.vision.block_count default=0 time=2025-11-04T05:38:23.008-06:00 level=DEBUG source=ggml.go:611 msg="default cache size estimate" "attention MiB"=1792 "attention bytes"=1879048192 "recurrent MiB"=0 "recurrent bytes"=0 time=2025-11-04T05:38:23.008-06:00 level=INFO source=server.go:522 msg=offload library=Metal layers.requested=-1 layers.model=29 layers.offload=10 layers.split=[10] memory.available="[2.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.1 GiB" memory.required.partial="2.6 GiB" memory.required.kv="1.8 GiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.6 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="824.0 MiB" time=2025-11-04T05:38:23.013-06:00 level=INFO source=runner.go:845 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:4 GPULayers:10[ID:0 Layers:10(18..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-04T05:38:23.016-06:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" llama_model_load_from_file_impl: using device Metal (Apple M2) (unknown id) - 10922 MiB free time=2025-11-04T05:38:23.016-06:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 1.87 GiB (5.01 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG load: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG load: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG load: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG load: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG load: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG load: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG load: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG load: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG load: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG load: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG load: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG load: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG load: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG load: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG load: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG load: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG load: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG load: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG load: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG load: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG load: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG load: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG load: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG load: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG load: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG load: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG load: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG load: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG load: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG load: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG load: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG load: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG load: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG load: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG load: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG load: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG load: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG load: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG load: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG load: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG load: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG load: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG load: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG load: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG load: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG load: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG load: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG load: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG load: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG load: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG load: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG load: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG load: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG load: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG load: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG load: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG load: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG load: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG load: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG load: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG load: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG load: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG load: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG load: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG load: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG load: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG load: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG load: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG load: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG load: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG load: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG load: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG load: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG load: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG load: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG load: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG load: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG load: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG load: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG load: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG load: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG load: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG load: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG load: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG load: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG load: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG load: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG load: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG load: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG load: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG load: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG load: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG load: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG load: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG load: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG load: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG load: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG load: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG load: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG load: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG load: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG load: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG load: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG load: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG load: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG load: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG load: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG load: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG load: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG load: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG load: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG load: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG load: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG load: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG load: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG load: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG load: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG load: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG load: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG load: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG load: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG load: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG load: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG load: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG load: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG load: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG load: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG load: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG load: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG load: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG load: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG load: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG load: control token: 128010 '<|python_tag|>' is not marked as EOG load: control token: 128006 '<|start_header_id|>' is not marked as EOG load: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG load: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG load: control token: 128000 '<|begin_of_text|>' is not marked as EOG load: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG load: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG load: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG load: control token: 128007 '<|end_header_id|>' is not marked as EOG load: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG load: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG load: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG load: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG load: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG load: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG load: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG load: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG load: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG load: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG load: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG load: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG load: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG load: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG load: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG load: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG load: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG load: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG load: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG load: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG load: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG load: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG load: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG load: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG load: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG load: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG load: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG load: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG load: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG load: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG load: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG load: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG load: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG load: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG load: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG load: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG load: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG load: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG load: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG load: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG load: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG load: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG load: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG load: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG load: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG load: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG load: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG load: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG load: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG load: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG load: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG load: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG load: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG load: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG load: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG load: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG load: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG load: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG load: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG load: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG load: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG load: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG load: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG load: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG load: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG load: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG load: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG load: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG load: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG load: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG load: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG load: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG load: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG load: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG load: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG load: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG load: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG load: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG load: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG load: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG load: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG load: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG load: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG load: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG load: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG load: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG load: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG load: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG load: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG load: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG load: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG load: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG load: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG load: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG load: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG load: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG load: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG load: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG load: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG load: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG load: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG load: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG load: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG load: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG load: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG load: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG load: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG load: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG load: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG load: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG load: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG load: printing all EOG tokens: load: - 128001 ('<|end_of_text|>') load: - 128008 ('<|eom_id|>') load: - 128009 ('<|eot_id|>') load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 3072 print_info: n_layer = 28 print_info: n_head = 24 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 3 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 8192 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: model type = 3B print_info: model params = 3.21 B print_info: general.name = Llama 3.2 3B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128001 '<|end_of_text|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128001 '<|end_of_text|>' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: layer 0 assigned to device CPU, is_swa = 0 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 0 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CPU, is_swa = 0 load_tensors: layer 5 assigned to device CPU, is_swa = 0 load_tensors: layer 6 assigned to device CPU, is_swa = 0 load_tensors: layer 7 assigned to device CPU, is_swa = 0 load_tensors: layer 8 assigned to device CPU, is_swa = 0 load_tensors: layer 9 assigned to device CPU, is_swa = 0 load_tensors: layer 10 assigned to device CPU, is_swa = 0 load_tensors: layer 11 assigned to device CPU, is_swa = 0 load_tensors: layer 12 assigned to device CPU, is_swa = 0 load_tensors: layer 13 assigned to device CPU, is_swa = 0 load_tensors: layer 14 assigned to device CPU, is_swa = 0 load_tensors: layer 15 assigned to device CPU, is_swa = 0 load_tensors: layer 16 assigned to device CPU, is_swa = 0 load_tensors: layer 17 assigned to device CPU, is_swa = 0 load_tensors: layer 18 assigned to device Metal, is_swa = 0 load_tensors: layer 19 assigned to device Metal, is_swa = 0 load_tensors: layer 20 assigned to device Metal, is_swa = 0 load_tensors: layer 21 assigned to device Metal, is_swa = 0 load_tensors: layer 22 assigned to device Metal, is_swa = 0 load_tensors: layer 23 assigned to device Metal, is_swa = 0 load_tensors: layer 24 assigned to device Metal, is_swa = 0 load_tensors: layer 25 assigned to device Metal, is_swa = 0 load_tensors: layer 26 assigned to device Metal, is_swa = 0 load_tensors: layer 27 assigned to device Metal, is_swa = 0 load_tensors: layer 28 assigned to device CPU, is_swa = 0 create_tensor: loading tensor token_embd.weight create_tensor: loading tensor output_norm.weight create_tensor: loading tensor blk.0.attn_norm.weight create_tensor: loading tensor blk.0.attn_q.weight create_tensor: loading tensor blk.0.attn_k.weight create_tensor: loading tensor blk.0.attn_v.weight create_tensor: loading tensor blk.0.attn_output.weight create_tensor: loading tensor blk.0.ffn_norm.weight create_tensor: loading tensor rope_freqs.weight create_tensor: loading tensor blk.0.ffn_gate.weight create_tensor: loading tensor blk.0.ffn_down.weight create_tensor: loading tensor blk.0.ffn_up.weight create_tensor: loading tensor blk.1.attn_norm.weight create_tensor: loading tensor blk.1.attn_q.weight create_tensor: loading tensor blk.1.attn_k.weight create_tensor: loading tensor blk.1.attn_v.weight create_tensor: loading tensor blk.1.attn_output.weight create_tensor: loading tensor blk.1.ffn_norm.weight create_tensor: loading tensor blk.1.ffn_gate.weight create_tensor: loading tensor blk.1.ffn_down.weight create_tensor: loading tensor blk.1.ffn_up.weight create_tensor: loading tensor blk.2.attn_norm.weight create_tensor: loading tensor blk.2.attn_q.weight create_tensor: loading tensor blk.2.attn_k.weight create_tensor: loading tensor blk.2.attn_v.weight create_tensor: loading tensor blk.2.attn_output.weight create_tensor: loading tensor blk.2.ffn_norm.weight create_tensor: loading tensor blk.2.ffn_gate.weight create_tensor: loading tensor blk.2.ffn_down.weight create_tensor: loading tensor blk.2.ffn_up.weight create_tensor: loading tensor blk.3.attn_norm.weight create_tensor: loading tensor blk.3.attn_q.weight create_tensor: loading tensor blk.3.attn_k.weight create_tensor: loading tensor blk.3.attn_v.weight create_tensor: loading tensor blk.3.attn_output.weight create_tensor: loading tensor blk.3.ffn_norm.weight create_tensor: loading tensor blk.3.ffn_gate.weight create_tensor: loading tensor blk.3.ffn_down.weight create_tensor: loading tensor blk.3.ffn_up.weight create_tensor: loading tensor blk.4.attn_norm.weight create_tensor: loading tensor blk.4.attn_q.weight create_tensor: loading tensor blk.4.attn_k.weight create_tensor: loading tensor blk.4.attn_v.weight create_tensor: loading tensor blk.4.attn_output.weight create_tensor: loading tensor blk.4.ffn_norm.weight create_tensor: loading tensor blk.4.ffn_gate.weight create_tensor: loading tensor blk.4.ffn_down.weight create_tensor: loading tensor blk.4.ffn_up.weight create_tensor: loading tensor blk.5.attn_norm.weight create_tensor: loading tensor blk.5.attn_q.weight create_tensor: loading tensor blk.5.attn_k.weight create_tensor: loading tensor blk.5.attn_v.weight create_tensor: loading tensor blk.5.attn_output.weight create_tensor: loading tensor blk.5.ffn_norm.weight create_tensor: loading tensor blk.5.ffn_gate.weight create_tensor: loading tensor blk.5.ffn_down.weight create_tensor: loading tensor blk.5.ffn_up.weight create_tensor: loading tensor blk.6.attn_norm.weight create_tensor: loading tensor blk.6.attn_q.weight create_tensor: loading tensor blk.6.attn_k.weight create_tensor: loading tensor blk.6.attn_v.weight create_tensor: loading tensor blk.6.attn_output.weight create_tensor: loading tensor blk.6.ffn_norm.weight create_tensor: loading tensor blk.6.ffn_gate.weight create_tensor: loading tensor blk.6.ffn_down.weight create_tensor: loading tensor blk.6.ffn_up.weight create_tensor: loading tensor blk.7.attn_norm.weight create_tensor: loading tensor blk.7.attn_q.weight create_tensor: loading tensor blk.7.attn_k.weight create_tensor: loading tensor blk.7.attn_v.weight create_tensor: loading tensor blk.7.attn_output.weight create_tensor: loading tensor blk.7.ffn_norm.weight create_tensor: loading tensor blk.7.ffn_gate.weight create_tensor: loading tensor blk.7.ffn_down.weight create_tensor: loading tensor blk.7.ffn_up.weight create_tensor: loading tensor blk.8.attn_norm.weight create_tensor: loading tensor blk.8.attn_q.weight create_tensor: loading tensor blk.8.attn_k.weight create_tensor: loading tensor blk.8.attn_v.weight create_tensor: loading tensor blk.8.attn_output.weight create_tensor: loading tensor blk.8.ffn_norm.weight create_tensor: loading tensor blk.8.ffn_gate.weight create_tensor: loading tensor blk.8.ffn_down.weight create_tensor: loading tensor blk.8.ffn_up.weight create_tensor: loading tensor blk.9.attn_norm.weight create_tensor: loading tensor blk.9.attn_q.weight create_tensor: loading tensor blk.9.attn_k.weight create_tensor: loading tensor blk.9.attn_v.weight create_tensor: loading tensor blk.9.attn_output.weight create_tensor: loading tensor blk.9.ffn_norm.weight create_tensor: loading tensor blk.9.ffn_gate.weight create_tensor: loading tensor blk.9.ffn_down.weight create_tensor: loading tensor blk.9.ffn_up.weight create_tensor: loading tensor blk.10.attn_norm.weight create_tensor: loading tensor blk.10.attn_q.weight create_tensor: loading tensor blk.10.attn_k.weight create_tensor: loading tensor blk.10.attn_v.weight create_tensor: loading tensor blk.10.attn_output.weight create_tensor: loading tensor blk.10.ffn_norm.weight create_tensor: loading tensor blk.10.ffn_gate.weight create_tensor: loading tensor blk.10.ffn_down.weight create_tensor: loading tensor blk.10.ffn_up.weight create_tensor: loading tensor blk.11.attn_norm.weight create_tensor: loading tensor blk.11.attn_q.weight create_tensor: loading tensor blk.11.attn_k.weight create_tensor: loading tensor blk.11.attn_v.weight create_tensor: loading tensor blk.11.attn_output.weight create_tensor: loading tensor blk.11.ffn_norm.weight create_tensor: loading tensor blk.11.ffn_gate.weight create_tensor: loading tensor blk.11.ffn_down.weight create_tensor: loading tensor blk.11.ffn_up.weight create_tensor: loading tensor blk.12.attn_norm.weight create_tensor: loading tensor blk.12.attn_q.weight create_tensor: loading tensor blk.12.attn_k.weight create_tensor: loading tensor blk.12.attn_v.weight create_tensor: loading tensor blk.12.attn_output.weight create_tensor: loading tensor blk.12.ffn_norm.weight create_tensor: loading tensor blk.12.ffn_gate.weight create_tensor: loading tensor blk.12.ffn_down.weight create_tensor: loading tensor blk.12.ffn_up.weight create_tensor: loading tensor blk.13.attn_norm.weight create_tensor: loading tensor blk.13.attn_q.weight create_tensor: loading tensor blk.13.attn_k.weight create_tensor: loading tensor blk.13.attn_v.weight create_tensor: loading tensor blk.13.attn_output.weight create_tensor: loading tensor blk.13.ffn_norm.weight create_tensor: loading tensor blk.13.ffn_gate.weight create_tensor: loading tensor blk.13.ffn_down.weight create_tensor: loading tensor blk.13.ffn_up.weight create_tensor: loading tensor blk.14.attn_norm.weight create_tensor: loading tensor blk.14.attn_q.weight create_tensor: loading tensor blk.14.attn_k.weight create_tensor: loading tensor blk.14.attn_v.weight create_tensor: loading tensor blk.14.attn_output.weight create_tensor: loading tensor blk.14.ffn_norm.weight create_tensor: loading tensor blk.14.ffn_gate.weight create_tensor: loading tensor blk.14.ffn_down.weight create_tensor: loading tensor blk.14.ffn_up.weight create_tensor: loading tensor blk.15.attn_norm.weight create_tensor: loading tensor blk.15.attn_q.weight create_tensor: loading tensor blk.15.attn_k.weight create_tensor: loading tensor blk.15.attn_v.weight create_tensor: loading tensor blk.15.attn_output.weight create_tensor: loading tensor blk.15.ffn_norm.weight create_tensor: loading tensor blk.15.ffn_gate.weight create_tensor: loading tensor blk.15.ffn_down.weight create_tensor: loading tensor blk.15.ffn_up.weight create_tensor: loading tensor blk.16.attn_norm.weight create_tensor: loading tensor blk.16.attn_q.weight create_tensor: loading tensor blk.16.attn_k.weight create_tensor: loading tensor blk.16.attn_v.weight create_tensor: loading tensor blk.16.attn_output.weight create_tensor: loading tensor blk.16.ffn_norm.weight create_tensor: loading tensor blk.16.ffn_gate.weight create_tensor: loading tensor blk.16.ffn_down.weight create_tensor: loading tensor blk.16.ffn_up.weight create_tensor: loading tensor blk.17.attn_norm.weight create_tensor: loading tensor blk.17.attn_q.weight create_tensor: loading tensor blk.17.attn_k.weight create_tensor: loading tensor blk.17.attn_v.weight create_tensor: loading tensor blk.17.attn_output.weight create_tensor: loading tensor blk.17.ffn_norm.weight create_tensor: loading tensor blk.17.ffn_gate.weight create_tensor: loading tensor blk.17.ffn_down.weight create_tensor: loading tensor blk.17.ffn_up.weight create_tensor: loading tensor blk.18.attn_norm.weight create_tensor: loading tensor blk.18.attn_q.weight create_tensor: loading tensor blk.18.attn_k.weight create_tensor: loading tensor blk.18.attn_v.weight create_tensor: loading tensor blk.18.attn_output.weight create_tensor: loading tensor blk.18.ffn_norm.weight create_tensor: loading tensor rope_freqs.weight create_tensor: loading tensor blk.18.ffn_gate.weight create_tensor: loading tensor blk.18.ffn_down.weight create_tensor: loading tensor blk.18.ffn_up.weight create_tensor: loading tensor blk.19.attn_norm.weight create_tensor: loading tensor blk.19.attn_q.weight create_tensor: loading tensor blk.19.attn_k.weight create_tensor: loading tensor blk.19.attn_v.weight create_tensor: loading tensor blk.19.attn_output.weight create_tensor: loading tensor blk.19.ffn_norm.weight create_tensor: loading tensor blk.19.ffn_gate.weight create_tensor: loading tensor blk.19.ffn_down.weight create_tensor: loading tensor blk.19.ffn_up.weight create_tensor: loading tensor blk.20.attn_norm.weight create_tensor: loading tensor blk.20.attn_q.weight create_tensor: loading tensor blk.20.attn_k.weight create_tensor: loading tensor blk.20.attn_v.weight create_tensor: loading tensor blk.20.attn_output.weight create_tensor: loading tensor blk.20.ffn_norm.weight create_tensor: loading tensor blk.20.ffn_gate.weight create_tensor: loading tensor blk.20.ffn_down.weight create_tensor: loading tensor blk.20.ffn_up.weight create_tensor: loading tensor blk.21.attn_norm.weight create_tensor: loading tensor blk.21.attn_q.weight create_tensor: loading tensor blk.21.attn_k.weight create_tensor: loading tensor blk.21.attn_v.weight create_tensor: loading tensor blk.21.attn_output.weight create_tensor: loading tensor blk.21.ffn_norm.weight create_tensor: loading tensor blk.21.ffn_gate.weight create_tensor: loading tensor blk.21.ffn_down.weight create_tensor: loading tensor blk.21.ffn_up.weight create_tensor: loading tensor blk.22.attn_norm.weight create_tensor: loading tensor blk.22.attn_q.weight create_tensor: loading tensor blk.22.attn_k.weight create_tensor: loading tensor blk.22.attn_v.weight create_tensor: loading tensor blk.22.attn_output.weight create_tensor: loading tensor blk.22.ffn_norm.weight create_tensor: loading tensor blk.22.ffn_gate.weight create_tensor: loading tensor blk.22.ffn_down.weight create_tensor: loading tensor blk.22.ffn_up.weight create_tensor: loading tensor blk.23.attn_norm.weight create_tensor: loading tensor blk.23.attn_q.weight create_tensor: loading tensor blk.23.attn_k.weight create_tensor: loading tensor blk.23.attn_v.weight create_tensor: loading tensor blk.23.attn_output.weight create_tensor: loading tensor blk.23.ffn_norm.weight create_tensor: loading tensor blk.23.ffn_gate.weight create_tensor: loading tensor blk.23.ffn_down.weight create_tensor: loading tensor blk.23.ffn_up.weight create_tensor: loading tensor blk.24.attn_norm.weight create_tensor: loading tensor blk.24.attn_q.weight create_tensor: loading tensor blk.24.attn_k.weight create_tensor: loading tensor blk.24.attn_v.weight create_tensor: loading tensor blk.24.attn_output.weight create_tensor: loading tensor blk.24.ffn_norm.weight create_tensor: loading tensor blk.24.ffn_gate.weight create_tensor: loading tensor blk.24.ffn_down.weight create_tensor: loading tensor blk.24.ffn_up.weight create_tensor: loading tensor blk.25.attn_norm.weight create_tensor: loading tensor blk.25.attn_q.weight create_tensor: loading tensor blk.25.attn_k.weight create_tensor: loading tensor blk.25.attn_v.weight create_tensor: loading tensor blk.25.attn_output.weight create_tensor: loading tensor blk.25.ffn_norm.weight create_tensor: loading tensor blk.25.ffn_gate.weight create_tensor: loading tensor blk.25.ffn_down.weight create_tensor: loading tensor blk.25.ffn_up.weight create_tensor: loading tensor blk.26.attn_norm.weight create_tensor: loading tensor blk.26.attn_q.weight create_tensor: loading tensor blk.26.attn_k.weight create_tensor: loading tensor blk.26.attn_v.weight create_tensor: loading tensor blk.26.attn_output.weight create_tensor: loading tensor blk.26.ffn_norm.weight create_tensor: loading tensor blk.26.ffn_gate.weight create_tensor: loading tensor blk.26.ffn_down.weight create_tensor: loading tensor blk.26.ffn_up.weight create_tensor: loading tensor blk.27.attn_norm.weight create_tensor: loading tensor blk.27.attn_q.weight create_tensor: loading tensor blk.27.attn_k.weight create_tensor: loading tensor blk.27.attn_v.weight create_tensor: loading tensor blk.27.attn_output.weight create_tensor: loading tensor blk.27.ffn_norm.weight create_tensor: loading tensor blk.27.ffn_gate.weight create_tensor: loading tensor blk.27.ffn_down.weight create_tensor: loading tensor blk.27.ffn_up.weight load_tensors: offloading 10 repeating layers to GPU load_tensors: offloaded 10/29 layers to GPU load_tensors: CPU model buffer size = 1330.17 MiB load_tensors: Metal model buffer size = 588.19 MiB load_all_data: no device found for buffer type CPU for async uploads time=2025-11-04T05:38:23.519-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.50" load_all_data: device Metal does not support async, host buffers or events time=2025-11-04T05:38:23.771-06:00 level=DEBUG source=server.go:1295 msg="model load progress 0.83" llama_init_from_model: model default pooling_type is [0], but [-1] was specified llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 16384 llama_context: n_ctx_per_seq = 16384 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = disabled llama_context: kv_unified = false llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M2 ggml_metal_init: use bfloat = true ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true set_abort_callback: call llama_context: CPU output buffer size = 0.50 MiB create_memory: n_ctx = 16384 (padded) llama_kv_cache: layer 0: dev = CPU llama_kv_cache: layer 1: dev = CPU llama_kv_cache: layer 2: dev = CPU llama_kv_cache: layer 3: dev = CPU llama_kv_cache: layer 4: dev = CPU llama_kv_cache: layer 5: dev = CPU llama_kv_cache: layer 6: dev = CPU llama_kv_cache: layer 7: dev = CPU llama_kv_cache: layer 8: dev = CPU llama_kv_cache: layer 9: dev = CPU llama_kv_cache: layer 10: dev = CPU llama_kv_cache: layer 11: dev = CPU llama_kv_cache: layer 12: dev = CPU llama_kv_cache: layer 13: dev = CPU llama_kv_cache: layer 14: dev = CPU llama_kv_cache: layer 15: dev = CPU llama_kv_cache: layer 16: dev = CPU llama_kv_cache: layer 17: dev = CPU llama_kv_cache: layer 18: dev = Metal llama_kv_cache: layer 19: dev = Metal llama_kv_cache: layer 20: dev = Metal llama_kv_cache: layer 21: dev = Metal llama_kv_cache: layer 22: dev = Metal llama_kv_cache: layer 23: dev = Metal llama_kv_cache: layer 24: dev = Metal llama_kv_cache: layer 25: dev = Metal llama_kv_cache: layer 26: dev = Metal llama_kv_cache: layer 27: dev = Metal llama_kv_cache: CPU KV buffer size = 1152.00 MiB llama_kv_cache: Metal KV buffer size = 640.00 MiB llama_kv_cache: size = 1792.00 MiB ( 16384 cells, 28 layers, 1/1 seqs), K (f16): 896.00 MiB, V (f16): 896.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 3 llama_context: max_nodes = 2048 llama_context: reserving full memory module llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1 graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512 graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1 graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512 llama_context: Metal compute buffer size = 816.01 MiB llama_context: CPU compute buffer size = 828.01 MiB llama_context: graph nodes = 1014 llama_context: graph splits = 255 (with bs=512), 3 (with bs=1) time=2025-11-04T05:38:24.021-06:00 level=INFO source=server.go:1289 msg="llama runner started in 45.86 seconds" time=2025-11-04T05:38:24.021-06:00 level=INFO source=sched.go:493 msg="loaded runners" count=1 time=2025-11-04T05:38:24.021-06:00 level=DEBUG source=sched.go:602 msg="evaluating already loaded" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff time=2025-11-04T05:38:24.021-06:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-11-04T05:38:24.021-06:00 level=INFO source=server.go:1289 msg="llama runner started in 45.86 seconds" time=2025-11-04T05:38:24.021-06:00 level=DEBUG source=sched.go:505 msg="finished setting up" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="5.1 GiB" runner.vram="2.6 GiB" runner.parallel=1 runner.pid=21139 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 time=2025-11-04T05:38:24.021-06:00 level=DEBUG source=sched.go:602 msg="evaluating already loaded" model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff time=2025-11-04T05:38:24.023-06:00 level=DEBUG source=server.go:1401 msg="completion request" images=0 prompt=1079 format="" time=2025-11-04T05:38:24.024-06:00 level=DEBUG source=server.go:1401 msg="completion request" images=0 prompt=1079 format="" time=2025-11-04T05:38:24.024-06:00 level=DEBUG source=server.go:1401 msg="completion request" images=0 prompt=1023 format="" time=2025-11-04T05:38:24.024-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=229 used=0 remaining=229 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_fuse_1', name = 'kernel_add_fuse_1' ggml_metal_library_compile_pipeline: loaded kernel_add_fuse_1 0x145e07be0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_mul_f32_4', name = 'kernel_rms_norm_mul_f32_4' ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_mul_f32_4 0x145e08510 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q4_K_f32', name = 'kernel_mul_mm_q4_K_f32_bci=0_bco=1' ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q4_K_f32_bci=0_bco=1 0x145e08f20 | th_max = 896 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rope_norm_f32', name = 'kernel_rope_norm_f32' ggml_metal_library_compile_pipeline: loaded kernel_rope_norm_f32 0x145e09180 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_set_rows_f16_i64', name = 'kernel_set_rows_f16_i64' ggml_metal_library_compile_pipeline: loaded kernel_set_rows_f16_i64 0x145e093e0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_f16_f32', name = 'kernel_mul_mm_f16_f32_bci=0_bco=1' ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_f16_f32_bci=0_bco=1 0x145e0a200 | th_max = 896 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_soft_max_f32_4', name = 'kernel_soft_max_f32_4' ggml_metal_library_compile_pipeline: loaded kernel_soft_max_f32_4 0x145e098e0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_cpy_f32_f32', name = 'kernel_cpy_f32_f32' ggml_metal_library_compile_pipeline: loaded kernel_cpy_f32_f32 0x145e0a610 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_swiglu_f32', name = 'kernel_swiglu_f32' ggml_metal_library_compile_pipeline: loaded kernel_swiglu_f32 0x145e0ac80 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q6_K_f32', name = 'kernel_mul_mm_q6_K_f32_bci=0_bco=1' ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q6_K_f32_bci=0_bco=1 0x145e0bd30 | th_max = 896 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_get_rows_f32', name = 'kernel_get_rows_f32' ggml_metal_library_compile_pipeline: loaded kernel_get_rows_f32 0x129b09ae0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_1', name = 'kernel_add_row_c4_fuse_1' ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_1 0x129b0a110 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q4_K_f32', name = 'kernel_mul_mv_q4_K_f32_nsg=2' ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q4_K_f32_nsg=2 0x129b0b2d0 | th_max = 768 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q6_K_f32', name = 'kernel_mul_mv_q6_K_f32_nsg=2' ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q6_K_f32_nsg=2 0x129b0b880 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_f32_4', name = 'kernel_rms_norm_f32_4' ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_f32_4 0x129b0a9a0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_f16_f32_4', name = 'kernel_mul_mv_f16_f32_4_nsg=1' ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_f16_f32_4_nsg=1 0x145e0d210 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_f16_f32_4', name = 'kernel_mul_mv_f16_f32_4_nsg=2' ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_f16_f32_4_nsg=2 0x145e0ca90 | th_max = 1024 | th_width = 32 [GIN] 2025/11/04 - 05:38:25 | 200 | 51.105175s | 100.116.113.16 | POST "/v1/chat/completions" time=2025-11-04T05:38:25.848-06:00 level=DEBUG source=sched.go:513 msg="context for request finished" time=2025-11-04T05:38:25.848-06:00 level=DEBUG source=sched.go:301 msg="after processing request finished event" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="5.1 GiB" runner.vram="2.6 GiB" runner.parallel=1 runner.pid=21139 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 refCount=2 time=2025-11-04T05:38:25.849-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=233 prompt=229 used=228 remaining=1 [GIN] 2025/11/04 - 05:38:26 | 200 | 3.073442708s | 100.116.113.16 | POST "/v1/chat/completions" time=2025-11-04T05:38:26.050-06:00 level=DEBUG source=sched.go:378 msg="context for request finished" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="5.1 GiB" runner.vram="2.6 GiB" runner.parallel=1 runner.pid=21139 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 time=2025-11-04T05:38:26.050-06:00 level=DEBUG source=sched.go:301 msg="after processing request finished event" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="5.1 GiB" runner.vram="2.6 GiB" runner.parallel=1 runner.pid=21139 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 refCount=1 time=2025-11-04T05:38:26.051-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=234 prompt=220 used=208 remaining=12 [GIN] 2025/11/04 - 05:38:26 | 200 | 51.766675292s | 100.116.113.16 | POST "/v1/chat/completions" time=2025-11-04T05:38:26.503-06:00 level=DEBUG source=sched.go:378 msg="context for request finished" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="5.1 GiB" runner.vram="2.6 GiB" runner.parallel=1 runner.pid=21139 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 time=2025-11-04T05:38:26.503-06:00 level=DEBUG source=sched.go:283 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="5.1 GiB" runner.vram="2.6 GiB" runner.parallel=1 runner.pid=21139 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 duration=2562047h47m16.854775807s time=2025-11-04T05:38:26.503-06:00 level=DEBUG source=sched.go:301 msg="after processing request finished event" runner.name=registry.ollama.ai/library/llama3.2:3b-instruct-q4_K_M runner.inference="[{ID:0 Library:Metal}]" runner.size="5.1 GiB" runner.vram="2.6 GiB" runner.parallel=1 runner.pid=21139 runner.model=/Users/pjv/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=16384 refCount=0 ``` </details>

GiteaMirror commented

2026-05-04 22:18:01 -05:00

@pjv commented on GitHub (Nov 7, 2025):

Fixed indeed in v. 0.12.10.

@pjv commented on GitHub (Nov 7, 2025): Fixed indeed in v. 0.12.10.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

dhiltgen/llama-runner

hoyyeva/anthropic-local-image-path

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#70619