[GH-ISSUE #12094] GPU VRAM Free 3.9GB, Model 3.8GB, not switching to CPU. phi4-mini-reasoning:3.8b and phi4-reasoning:14b crashing since upgrading to ollama 0.11.7 #54549

Open
opened 2026-04-29 06:18:51 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @BiggRanger on GitHub (Aug 27, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12094

What is the issue?

Since upgrading to ollama 0.11.7 I have not been able to run phi4-mini-reasoning:3.8b or phi4-reasoning:14b. Immediately after entering a prompt I get an error and it drops back to the command prompt.

gpt-oss:20b, gemma3n:e4b, and granite3.3:2b seem to work fine in version 0.11.7

PC is Dell Precision 7510, 32GB RAM, Nvidia M200M 4GB VRAM. Kubuntu 24.04.

Relevant log output

-- Boot ac336d1928a54bdd8ca2ce5d13f20ff7 --
Aug 26 18:03:15 D7510DC systemd[1]: Started ollama.service - Ollama Service.
Aug 26 18:03:15 D7510DC ollama[2730]: time=2025-08-26T18:03:15.640-04:00 level=INFO source=routes.go:1331 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Aug 26 18:03:15 D7510DC ollama[2730]: time=2025-08-26T18:03:15.651-04:00 level=INFO source=images.go:477 msg="total blobs: 27"
Aug 26 18:03:15 D7510DC ollama[2730]: time=2025-08-26T18:03:15.652-04:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Aug 26 18:03:15 D7510DC ollama[2730]: time=2025-08-26T18:03:15.653-04:00 level=INFO source=routes.go:1384 msg="Listening on 127.0.0.1:11434 (version 0.11.7)"
Aug 26 18:03:15 D7510DC ollama[2730]: time=2025-08-26T18:03:15.654-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Aug 26 18:03:15 D7510DC ollama[2730]: time=2025-08-26T18:03:15.829-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-cec41a1e-6c00-eb82-ec97-e5f8f1a23ee8 library=cuda variant=v12 compute=5.0 driver=12.9 name="Quadro M2000M" total="3.9 GiB" available="3.9 GiB"
Aug 26 18:03:15 D7510DC ollama[2730]: time=2025-08-26T18:03:15.829-04:00 level=INFO source=routes.go:1425 msg="entering low vram mode" "total vram"="3.9 GiB" threshold="20.0 GiB"
Aug 26 20:46:02 D7510DC ollama[2730]: [GIN] 2025/08/26 - 20:46:02 | 200 |      993.46µs |       127.0.0.1 | HEAD     "/"
Aug 26 20:46:02 D7510DC ollama[2730]: [GIN] 2025/08/26 - 20:46:02 | 200 |  106.100586ms |       127.0.0.1 | POST     "/api/show"
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: loaded meta data with 36 key-value pairs and 196 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-f4dd2368e6c32725dc1c5c5548ae9ee2724d6a79052952eb50b65e26288022c4 (version GGUF V3 (latest))
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv   0:                       general.architecture str              = phi3
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv   1:              phi3.rope.scaling.attn_factor f32              = 1.190238
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv   2:                               general.type str              = model
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv   3:                               general.name str              = Phi 4 Mini Reasoning
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv   4:                           general.finetune str              = reasoning
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv   5:                           general.basename str              = Phi-4
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv   6:                         general.size_label str              = mini
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv   7:                            general.license str              = mit
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/microsoft/Phi-...
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv   9:                               general.tags arr[str,4]       = ["nlp", "math", "code", "text-generat...
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  10:                          general.languages arr[str,1]       = ["en"]
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  11:                        phi3.context_length u32              = 131072
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  12:  phi3.rope.scaling.original_context_length u32              = 4096
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  13:                      phi3.embedding_length u32              = 3072
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  14:                   phi3.feed_forward_length u32              = 8192
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  15:                           phi3.block_count u32              = 32
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  16:                  phi3.attention.head_count u32              = 24
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  17:               phi3.attention.head_count_kv u32              = 8
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  18:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  19:                  phi3.rope.dimension_count u32              = 96
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  20:                        phi3.rope.freq_base f32              = 10000.000000
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  21:              phi3.attention.sliding_window u32              = 262144
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = gpt-4o
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,200064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,199742]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "e r", ...
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 199999
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 199999
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  29:            tokenizer.ggml.unknown_token_id u32              = 199999
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 199999
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  32:               tokenizer.ggml.add_eos_token bool             = false
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {{ '<|system|>Your name is Phi, an AI...
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  34:               general.quantization_version u32              = 2
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv  35:                          general.file_type u32              = 15
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - type  f32:   67 tensors
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - type  f16:   32 tensors
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - type q4_K:   80 tensors
Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - type q6_K:   17 tensors
Aug 26 20:46:02 D7510DC ollama[2730]: print_info: file format = GGUF V3 (latest)
Aug 26 20:46:02 D7510DC ollama[2730]: print_info: file type   = Q4_K - Medium
Aug 26 20:46:02 D7510DC ollama[2730]: print_info: file size   = 2.93 GiB (6.56 BPW)
Aug 26 20:46:03 D7510DC ollama[2730]: load: printing all EOG tokens:
Aug 26 20:46:03 D7510DC ollama[2730]: load:   - 199999 ('<|endoftext|>')
Aug 26 20:46:03 D7510DC ollama[2730]: load:   - 200020 ('<|end|>')
Aug 26 20:46:03 D7510DC ollama[2730]: load: special tokens cache size = 12
Aug 26 20:46:03 D7510DC ollama[2730]: load: token to piece cache size = 1.3333 MB
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: arch             = phi3
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: vocab_only       = 1
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: model type       = ?B
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: model params     = 3.84 B
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: general.name     = Phi 4 Mini Reasoning
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: vocab type       = BPE
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_vocab          = 200064
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_merges         = 199742
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: BOS token        = 199999 '<|endoftext|>'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOS token        = 199999 '<|endoftext|>'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOT token        = 199999 '<|endoftext|>'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: UNK token        = 199999 '<|endoftext|>'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: PAD token        = 199999 '<|endoftext|>'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: LF token         = 198 'Ċ'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOG token        = 199999 '<|endoftext|>'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOG token        = 200020 '<|end|>'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: max token length = 256
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_load: vocab only - skipping tensors
Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.261-04:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-f4dd2368e6c32725dc1c5c5548ae9ee2724d6a79052952eb50b65e26288022c4 --port 32793"
Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.278-04:00 level=INFO source=runner.go:864 msg="starting go runner"
Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.305-04:00 level=INFO source=server.go:488 msg="system memory" total="31.0 GiB" free="27.4 GiB" free_swap="512.0 MiB"
Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.306-04:00 level=INFO source=server.go:528 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=32 layers.split=[32] memory.available="[3.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.2 GiB" memory.required.partial="3.8 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[3.8 GiB]" memory.weights.total="2.9 GiB" memory.weights.repeating="2.5 GiB" memory.weights.nonrepeating="480.8 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB"
Aug 26 20:46:03 D7510DC ollama[2730]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 26 20:46:03 D7510DC ollama[2730]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 26 20:46:03 D7510DC ollama[2730]: ggml_cuda_init: found 1 CUDA devices:
Aug 26 20:46:03 D7510DC ollama[2730]:   Device 0: Quadro M2000M, compute capability 5.0, VMM: yes, ID: GPU-cec41a1e-6c00-eb82-ec97-e5f8f1a23ee8
Aug 26 20:46:03 D7510DC ollama[2730]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 26 20:46:03 D7510DC ollama[2730]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.446-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.446-04:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:32793"
Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.455-04:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:4 GPULayers:32[ID:GPU-cec41a1e-6c00-eb82-ec97-e5f8f1a23ee8 Layers:32(0..31)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_load_from_file_impl: using device CUDA0 (Quadro M2000M) - 3999 MiB free
Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.490-04:00 level=INFO source=server.go:1231 msg="waiting for llama runner to start responding"
Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.490-04:00 level=INFO source=server.go:1265 msg="waiting for server to become available" status="llm server loading model"
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: loaded meta data with 36 key-value pairs and 196 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-f4dd2368e6c32725dc1c5c5548ae9ee2724d6a79052952eb50b65e26288022c4 (version GGUF V3 (latest))
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv   0:                       general.architecture str              = phi3
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv   1:              phi3.rope.scaling.attn_factor f32              = 1.190238
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv   2:                               general.type str              = model
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv   3:                               general.name str              = Phi 4 Mini Reasoning
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv   4:                           general.finetune str              = reasoning
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv   5:                           general.basename str              = Phi-4
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv   6:                         general.size_label str              = mini
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv   7:                            general.license str              = mit
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/microsoft/Phi-...
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv   9:                               general.tags arr[str,4]       = ["nlp", "math", "code", "text-generat...
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  10:                          general.languages arr[str,1]       = ["en"]
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  11:                        phi3.context_length u32              = 131072
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  12:  phi3.rope.scaling.original_context_length u32              = 4096
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  13:                      phi3.embedding_length u32              = 3072
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  14:                   phi3.feed_forward_length u32              = 8192
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  15:                           phi3.block_count u32              = 32
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  16:                  phi3.attention.head_count u32              = 24
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  17:               phi3.attention.head_count_kv u32              = 8
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  18:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  19:                  phi3.rope.dimension_count u32              = 96
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  20:                        phi3.rope.freq_base f32              = 10000.000000
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  21:              phi3.attention.sliding_window u32              = 262144
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = gpt-4o
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,200064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,199742]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "e r", ...
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 199999
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 199999
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  29:            tokenizer.ggml.unknown_token_id u32              = 199999
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 199999
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  32:               tokenizer.ggml.add_eos_token bool             = false
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {{ '<|system|>Your name is Phi, an AI...
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  34:               general.quantization_version u32              = 2
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv  35:                          general.file_type u32              = 15
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - type  f32:   67 tensors
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - type  f16:   32 tensors
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - type q4_K:   80 tensors
Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - type q6_K:   17 tensors
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: file format = GGUF V3 (latest)
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: file type   = Q4_K - Medium
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: file size   = 2.93 GiB (6.56 BPW)
Aug 26 20:46:03 D7510DC ollama[2730]: load_hparams: Phi SWA is currently disabled - results might be suboptimal for some models (see https://github.com/ggml-org/llama.cpp/pull/13676)
Aug 26 20:46:03 D7510DC ollama[2730]: load: printing all EOG tokens:
Aug 26 20:46:03 D7510DC ollama[2730]: load:   - 199999 ('<|endoftext|>')
Aug 26 20:46:03 D7510DC ollama[2730]: load:   - 200020 ('<|end|>')
Aug 26 20:46:03 D7510DC ollama[2730]: load: special tokens cache size = 12
Aug 26 20:46:03 D7510DC ollama[2730]: load: token to piece cache size = 1.3333 MB
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: arch             = phi3
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: vocab_only       = 0
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_ctx_train      = 131072
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_embd           = 3072
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_layer          = 32
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_head           = 24
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_head_kv        = 8
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_rot            = 96
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_swa            = 0
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: is_swa_any       = 0
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_embd_head_k    = 128
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_embd_head_v    = 128
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_gqa            = 3
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_embd_k_gqa     = 1024
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_embd_v_gqa     = 1024
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: f_norm_eps       = 0.0e+00
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: f_norm_rms_eps   = 1.0e-05
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: f_clamp_kqv      = 0.0e+00
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: f_max_alibi_bias = 0.0e+00
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: f_logit_scale    = 0.0e+00
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: f_attn_scale     = 0.0e+00
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_ff             = 8192
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_expert         = 0
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_expert_used    = 0
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: causal attn      = 1
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: pooling type     = 0
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: rope type        = 2
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: rope scaling     = linear
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: freq_base_train  = 10000.0
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: freq_scale_train = 1
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_ctx_orig_yarn  = 4096
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: rope_finetuned   = unknown
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: model type       = 3B
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: model params     = 3.84 B
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: general.name     = Phi 4 Mini Reasoning
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: vocab type       = BPE
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_vocab          = 200064
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_merges         = 199742
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: BOS token        = 199999 '<|endoftext|>'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOS token        = 199999 '<|endoftext|>'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOT token        = 199999 '<|endoftext|>'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: UNK token        = 199999 '<|endoftext|>'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: PAD token        = 199999 '<|endoftext|>'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: LF token         = 198 'Ċ'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOG token        = 199999 '<|endoftext|>'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOG token        = 200020 '<|end|>'
Aug 26 20:46:03 D7510DC ollama[2730]: print_info: max token length = 256
Aug 26 20:46:03 D7510DC ollama[2730]: load_tensors: loading model tensors, this can take a while... (mmap = true)
Aug 26 20:46:06 D7510DC ollama[2730]: load_tensors: offloading 32 repeating layers to GPU
Aug 26 20:46:06 D7510DC ollama[2730]: load_tensors: offloaded 32/33 layers to GPU
Aug 26 20:46:06 D7510DC ollama[2730]: load_tensors:        CUDA0 model buffer size =  2517.75 MiB
Aug 26 20:46:06 D7510DC ollama[2730]: load_tensors:   CPU_Mapped model buffer size =   480.82 MiB
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: constructing llama_context
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: n_seq_max     = 1
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: n_ctx         = 4096
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: n_ctx_per_seq = 4096
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: n_batch       = 512
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: n_ubatch      = 512
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: causal_attn   = 1
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: flash_attn    = 0
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: kv_unified    = false
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: freq_base     = 10000.0
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: freq_scale    = 1
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context:        CPU  output buffer size =     0.77 MiB
Aug 26 20:46:07 D7510DC ollama[2730]: llama_kv_cache_unified:      CUDA0 KV buffer size =   512.00 MiB
Aug 26 20:46:07 D7510DC ollama[2730]: llama_kv_cache_unified: size =  512.00 MiB (  4096 cells,  32 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context:      CUDA0 compute buffer size =   883.56 MiB
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context:  CUDA_Host compute buffer size =    18.01 MiB
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: graph nodes  = 1126
Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: graph splits = 4 (with bs=512), 3 (with bs=1)
Aug 26 20:46:07 D7510DC ollama[2730]: time=2025-08-26T20:46:07.253-04:00 level=INFO source=server.go:1269 msg="llama runner started in 3.99 seconds"
Aug 26 20:46:07 D7510DC ollama[2730]: time=2025-08-26T20:46:07.253-04:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Aug 26 20:46:07 D7510DC ollama[2730]: time=2025-08-26T20:46:07.253-04:00 level=INFO source=server.go:1231 msg="waiting for llama runner to start responding"
Aug 26 20:46:07 D7510DC ollama[2730]: time=2025-08-26T20:46:07.253-04:00 level=INFO source=server.go:1269 msg="llama runner started in 3.99 seconds"
Aug 26 20:46:07 D7510DC ollama[2730]: [GIN] 2025/08/26 - 20:46:07 | 200 |  4.805253518s |       127.0.0.1 | POST     "/api/generate"
Aug 26 20:46:12 D7510DC ollama[2730]: CUDA error: out of memory
Aug 26 20:46:12 D7510DC ollama[2730]:   current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:503
Aug 26 20:46:12 D7510DC ollama[2730]:   cuMemCreate(&handle, reserve_size, &prop, 0)
Aug 26 20:46:12 D7510DC ollama[2730]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:84: CUDA error
Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18857]
Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18852]
Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18851]
Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18850]
Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18848]
Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18847]
Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18846]
Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18845]
Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18844]
Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18843]
Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18842]
Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18841]
Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18840]
Aug 26 20:46:12 D7510DC ollama[18858]: [Thread debugging using libthread_db enabled]
Aug 26 20:46:12 D7510DC ollama[18858]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Aug 26 20:46:12 D7510DC ollama[18858]: 0x000058058d9c4da3 in ?? ()
Aug 26 20:46:12 D7510DC ollama[18858]: #0  0x000058058d9c4da3 in ?? ()
Aug 26 20:46:12 D7510DC ollama[18858]: #1  0x000058058d981070 in ?? ()
Aug 26 20:46:12 D7510DC ollama[18858]: #2  0x000058058f704e60 in ?? ()
Aug 26 20:46:12 D7510DC ollama[18858]: #3  0x0000000000000080 in ?? ()
Aug 26 20:46:12 D7510DC ollama[18858]: #4  0x0000000000000000 in ?? ()
Aug 26 20:46:12 D7510DC ollama[18858]: [Inferior 1 (process 18838) detached]
Aug 26 20:46:12 D7510DC ollama[2730]: SIGABRT: abort
Aug 26 20:46:12 D7510DC ollama[2730]: PC=0x78dbafa9eb2c m=4 sigcode=18446744073709551610
Aug 26 20:46:12 D7510DC ollama[2730]: signal arrived during cgo execution
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 9 gp=0xc000504700 m=4 mp=0xc000079808 [syscall]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.cgocall(0x58058e696f60, 0xc00008bbd8)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/cgocall.go:167 +0x4b fp=0xc00008bbb0 sp=0xc00008bb78 pc=0x58058d9b83eb
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/llama._Cfunc_llama_decode(0x5805a8e1b7b0, {0x12, 0x5805a90905b0, 0x0, 0x5805a90a2f50, 0x5805a908cfd0, 0x5805a9072e90, 0x5805a908ed40})
Aug 26 20:46:12 D7510DC ollama[2730]:         _cgo_gotypes.go:668 +0x4a fp=0xc00008bbd8 sp=0xc00008bbb0 pc=0x58058dd6796a
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/llama.(*Context).Decode.func1(...)
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/ollama/ollama/llama/llama.go:150
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/llama.(*Context).Decode(0xc00011cd88?, 0x1?)
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/ollama/ollama/llama/llama.go:150 +0xed fp=0xc00008bcc0 sp=0xc00008bbd8 pc=0x58058dd6a74d
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner.(*Server).processBatch(0xc0002e05a0, 0xc0007205f0, 0xc00011cf28)
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/ollama/ollama/runner/llamarunner/runner.go:441 +0x209 fp=0xc00008bee8 sp=0xc00008bcc0 pc=0x58058de31029
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner.(*Server).run(0xc0002e05a0, {0x58058ee620c0, 0xc000720190})
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/ollama/ollama/runner/llamarunner/runner.go:346 +0x1d5 fp=0xc00008bfb8 sp=0xc00008bee8 pc=0x58058de30cb5
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner.Execute.gowrap1()
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/ollama/ollama/runner/llamarunner/runner.go:880 +0x28 fp=0xc00008bfe0 sp=0xc00008bfb8 pc=0x58058de35a08
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc00008bfe8 sp=0xc00008bfe0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/ollama/ollama/runner/llamarunner/runner.go:880 +0x4c5
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 1 gp=0xc000002380 m=nil [IO wait]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc00038f790 sp=0xc00038f770 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.netpollblock(0xc00050f7e0?, 0x8d954666?, 0x5?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/netpoll.go:575 +0xf7 fp=0xc00038f7c8 sp=0xc00038f790 pc=0x58058d980357
Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.runtime_pollWait(0x78dbafca7eb0, 0x72)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/netpoll.go:351 +0x85 fp=0xc00038f7e8 sp=0xc00038f7c8 pc=0x58058d9baa85
Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.(*pollDesc).wait(0xc00061a300?, 0x900000036?, 0x0)
Aug 26 20:46:12 D7510DC ollama[2730]:         internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00038f810 sp=0xc00038f7e8 pc=0x58058da41ec7
Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.(*pollDesc).waitRead(...)
Aug 26 20:46:12 D7510DC ollama[2730]:         internal/poll/fd_poll_runtime.go:89
Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.(*FD).Accept(0xc00061a300)
Aug 26 20:46:12 D7510DC ollama[2730]:         internal/poll/fd_unix.go:620 +0x295 fp=0xc00038f8b8 sp=0xc00038f810 pc=0x58058da47295
Aug 26 20:46:12 D7510DC ollama[2730]: net.(*netFD).accept(0xc00061a300)
Aug 26 20:46:12 D7510DC ollama[2730]:         net/fd_unix.go:172 +0x29 fp=0xc00038f970 sp=0xc00038f8b8 pc=0x58058daba249
Aug 26 20:46:12 D7510DC ollama[2730]: net.(*TCPListener).accept(0xc000728100)
Aug 26 20:46:12 D7510DC ollama[2730]:         net/tcpsock_posix.go:159 +0x1b fp=0xc00038f9c0 sp=0xc00038f970 pc=0x58058dacfbfb
Aug 26 20:46:12 D7510DC ollama[2730]: net.(*TCPListener).Accept(0xc000728100)
Aug 26 20:46:12 D7510DC ollama[2730]:         net/tcpsock.go:380 +0x30 fp=0xc00038f9f0 sp=0xc00038f9c0 pc=0x58058daceab0
Aug 26 20:46:12 D7510DC ollama[2730]: net/http.(*onceCloseListener).Accept(0xc0004c6090?)
Aug 26 20:46:12 D7510DC ollama[2730]:         <autogenerated>:1 +0x24 fp=0xc00038fa08 sp=0xc00038f9f0 pc=0x58058dce6204
Aug 26 20:46:12 D7510DC ollama[2730]: net/http.(*Server).Serve(0xc0004ac100, {0x58058ee5fc08, 0xc000728100})
Aug 26 20:46:12 D7510DC ollama[2730]:         net/http/server.go:3424 +0x30c fp=0xc00038fb38 sp=0xc00038fa08 pc=0x58058dcbdacc
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner.Execute({0xc000034260, 0x4, 0x4})
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/ollama/ollama/runner/llamarunner/runner.go:901 +0x8f5 fp=0xc00038fd08 sp=0xc00038fb38 pc=0x58058de35795
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner.Execute({0xc000034250?, 0x0?, 0x0?})
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/ollama/ollama/runner/runner.go:22 +0xd4 fp=0xc00038fd30 sp=0xc00038fd08 pc=0x58058debfd34
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/cmd.NewCLI.func2(0xc0001f1400?, {0x58058e97b081?, 0x4?, 0x58058e97b085?})
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/ollama/ollama/cmd/cmd.go:1583 +0x45 fp=0xc00038fd58 sp=0xc00038fd30 pc=0x58058e624f65
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/spf13/cobra.(*Command).execute(0xc00014cf08, {0xc0006967c0, 0x4, 0x4})
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc00038fe78 sp=0xc00038fd58 pc=0x58058db3389c
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/spf13/cobra.(*Command).ExecuteC(0xc00048ef08)
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc00038ff30 sp=0xc00038fe78 pc=0x58058db340e5
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/spf13/cobra.(*Command).Execute(...)
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/spf13/cobra@v1.7.0/command.go:992
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/spf13/cobra.(*Command).ExecuteContext(...)
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/spf13/cobra@v1.7.0/command.go:985
Aug 26 20:46:12 D7510DC ollama[2730]: main.main()
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/ollama/ollama/main.go:12 +0x4d fp=0xc00038ff50 sp=0xc00038ff30 pc=0x58058e625a4d
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.main()
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:283 +0x29d fp=0xc00038ffe0 sp=0xc00038ff50 pc=0x58058d9879dd
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc00038ffe8 sp=0xc00038ffe0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc000072fa8 sp=0xc000072f88 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goparkunlock(...)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:441
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.forcegchelper()
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:348 +0xb8 fp=0xc000072fe0 sp=0xc000072fa8 pc=0x58058d987d18
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc000072fe8 sp=0xc000072fe0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.init.7 in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:336 +0x1a
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 3 gp=0xc000003340 m=nil [GC sweep wait]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc000073780 sp=0xc000073760 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goparkunlock(...)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:441
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.bgsweep(0xc00007e000)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgcsweep.go:316 +0xdf fp=0xc0000737c8 sp=0xc000073780 pc=0x58058d9724bf
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcenable.gowrap1()
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:204 +0x25 fp=0xc0000737e0 sp=0xc0000737c8 pc=0x58058d9668a5
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc0000737e8 sp=0xc0000737e0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcenable in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:204 +0x66
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x10000?, 0x58058eb3eb70?, 0x0?, 0x0?, 0x0?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc000073f78 sp=0xc000073f58 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goparkunlock(...)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:441
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.(*scavengerState).park(0x58058f701f00)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgcscavenge.go:425 +0x49 fp=0xc000073fa8 sp=0xc000073f78 pc=0x58058d96ff09
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.bgscavenge(0xc00007e000)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgcscavenge.go:658 +0x59 fp=0xc000073fc8 sp=0xc000073fa8 pc=0x58058d970499
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcenable.gowrap2()
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:205 +0x25 fp=0xc000073fe0 sp=0xc000073fc8 pc=0x58058d966845
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc000073fe8 sp=0xc000073fe0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcenable in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:205 +0xa5
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 5 gp=0xc000003dc0 m=nil [finalizer wait]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x0?, 0x58058ee4cac0?, 0x0?, 0x60?, 0x1000000010?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc000072630 sp=0xc000072610 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.runfinq()
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mfinal.go:196 +0x107 fp=0xc0000727e0 sp=0xc000072630 pc=0x58058d965867
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc0000727e8 sp=0xc0000727e0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.createfing in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mfinal.go:166 +0x3d
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 6 gp=0xc0001d08c0 m=nil [chan receive]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0xc000223540?, 0xc000312018?, 0x60?, 0x47?, 0x58058daa0e88?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc000074718 sp=0xc0000746f8 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.chanrecv(0xc00003c380, 0x0, 0x1)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/chan.go:664 +0x445 fp=0xc000074790 sp=0xc000074718 pc=0x58058d957245
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.chanrecv1(0x0?, 0x0?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/chan.go:506 +0x12 fp=0xc0000747b8 sp=0xc000074790 pc=0x58058d956dd2
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.unique_runtime_registerUniqueMapCleanup.func2(...)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1796
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1799 +0x2f fp=0xc0000747e0 sp=0xc0000747b8 pc=0x58058d969a4f
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc0000747e8 sp=0xc0000747e0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by unique.runtime_registerUniqueMapCleanup in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1794 +0x85
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 7 gp=0xc0001d0e00 m=nil [GC worker (idle)]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc000074f38 sp=0xc000074f18 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1423 +0xe9 fp=0xc000074fc8 sp=0xc000074f38 pc=0x58058d968d69
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1()
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x25 fp=0xc000074fe0 sp=0xc000074fc8 pc=0x58058d968c45
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc000074fe8 sp=0xc000074fe0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x105
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 18 gp=0xc000504000 m=nil [GC worker (idle)]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x8e84710ceb0?, 0x0?, 0x0?, 0x0?, 0x0?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc00006e738 sp=0xc00006e718 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1423 +0xe9 fp=0xc00006e7c8 sp=0xc00006e738 pc=0x58058d968d69
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1()
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x25 fp=0xc00006e7e0 sp=0xc00006e7c8 pc=0x58058d968c45
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc00006e7e8 sp=0xc00006e7e0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x105
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 34 gp=0xc000102380 m=nil [GC worker (idle)]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x8e8471028f7?, 0x3?, 0x27?, 0x1f?, 0x0?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc00011a738 sp=0xc00011a718 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1423 +0xe9 fp=0xc00011a7c8 sp=0xc00011a738 pc=0x58058d968d69
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1()
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x25 fp=0xc00011a7e0 sp=0xc00011a7c8 pc=0x58058d968c45
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc00011a7e8 sp=0xc00011a7e0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x105
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 35 gp=0xc000102540 m=nil [GC worker (idle)]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x8e8470ed3a4?, 0x0?, 0x0?, 0x0?, 0x0?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc00011af38 sp=0xc00011af18 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1423 +0xe9 fp=0xc00011afc8 sp=0xc00011af38 pc=0x58058d968d69
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1()
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x25 fp=0xc00011afe0 sp=0xc00011afc8 pc=0x58058d968c45
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc00011afe8 sp=0xc00011afe0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x105
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 8 gp=0xc0001d0fc0 m=nil [GC worker (idle)]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x8e8470ef77d?, 0x0?, 0x0?, 0x0?, 0x0?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc000075738 sp=0xc000075718 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1423 +0xe9 fp=0xc0000757c8 sp=0xc000075738 pc=0x58058d968d69
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1()
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x25 fp=0xc0000757e0 sp=0xc0000757c8 pc=0x58058d968c45
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc0000757e8 sp=0xc0000757e0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x105
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 19 gp=0xc0005041c0 m=nil [GC worker (idle)]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x8e8470ec42b?, 0x0?, 0x0?, 0x0?, 0x0?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc00006ef38 sp=0xc00006ef18 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1423 +0xe9 fp=0xc00006efc8 sp=0xc00006ef38 pc=0x58058d968d69
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1()
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x25 fp=0xc00006efe0 sp=0xc00006efc8 pc=0x58058d968c45
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc00006efe8 sp=0xc00006efe0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x105
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 20 gp=0xc000504380 m=nil [GC worker (idle)]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x8e8470eef41?, 0x0?, 0x0?, 0x0?, 0x0?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc00006f738 sp=0xc00006f718 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1423 +0xe9 fp=0xc00006f7c8 sp=0xc00006f738 pc=0x58058d968d69
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1()
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x25 fp=0xc00006f7e0 sp=0xc00006f7c8 pc=0x58058d968c45
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc00006f7e8 sp=0xc00006f7e0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x105
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 21 gp=0xc000504540 m=nil [GC worker (idle)]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x8e8470ee258?, 0x0?, 0x0?, 0x0?, 0x0?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc00006ff38 sp=0xc00006ff18 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1423 +0xe9 fp=0xc00006ffc8 sp=0xc00006ff38 pc=0x58058d968d69
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1()
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x25 fp=0xc00006ffe0 sp=0xc00006ffc8 pc=0x58058d968c45
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc00006ffe8 sp=0xc00006ffe0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/mgc.go:1339 +0x105
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 10 gp=0xc0005048c0 m=nil [select]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0xc000049a78?, 0x2?, 0xa?, 0x0?, 0xc0000498c4?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc0000496f8 sp=0xc0000496d8 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.selectgo(0xc000049a78, 0xc0000498c0, 0x12?, 0x0, 0x1?, 0x1)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/select.go:351 +0x837 fp=0xc000049830 sp=0xc0000496f8 pc=0x58058d999ed7
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner.(*Server).completion(0xc0002e05a0, {0x58058ee5fde8, 0xc0006269a0}, 0xc00016e780)
Aug 26 20:46:12 D7510DC ollama[2730]:         github.com/ollama/ollama/runner/llamarunner/runner.go:629 +0xb37 fp=0xc000049ac0 sp=0xc000049830 pc=0x58058de32c37
Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner.(*Server).completion-fm({0x58058ee5fde8?, 0xc0006269a0?}, 0xc000049b40?)
Aug 26 20:46:12 D7510DC ollama[2730]:         <autogenerated>:1 +0x36 fp=0xc000049af0 sp=0xc000049ac0 pc=0x58058de35e16
Aug 26 20:46:12 D7510DC ollama[2730]: net/http.HandlerFunc.ServeHTTP(0xc0007260c0?, {0x58058ee5fde8?, 0xc0006269a0?}, 0xc000049b60?)
Aug 26 20:46:12 D7510DC ollama[2730]:         net/http/server.go:2294 +0x29 fp=0xc000049b18 sp=0xc000049af0 pc=0x58058dcba109
Aug 26 20:46:12 D7510DC ollama[2730]: net/http.(*ServeMux).ServeHTTP(0x58058d95fd85?, {0x58058ee5fde8, 0xc0006269a0}, 0xc00016e780)
Aug 26 20:46:12 D7510DC ollama[2730]:         net/http/server.go:2822 +0x1c4 fp=0xc000049b68 sp=0xc000049b18 pc=0x58058dcbc004
Aug 26 20:46:12 D7510DC ollama[2730]: net/http.serverHandler.ServeHTTP({0x58058ee5c470?}, {0x58058ee5fde8?, 0xc0006269a0?}, 0x1?)
Aug 26 20:46:12 D7510DC ollama[2730]:         net/http/server.go:3301 +0x8e fp=0xc000049b98 sp=0xc000049b68 pc=0x58058dcd9a8e
Aug 26 20:46:12 D7510DC ollama[2730]: net/http.(*conn).serve(0xc0004c6090, {0x58058ee62088, 0xc000706570})
Aug 26 20:46:12 D7510DC ollama[2730]:         net/http/server.go:2102 +0x625 fp=0xc000049fb8 sp=0xc000049b98 pc=0x58058dcb8605
Aug 26 20:46:12 D7510DC ollama[2730]: net/http.(*Server).Serve.gowrap3()
Aug 26 20:46:12 D7510DC ollama[2730]:         net/http/server.go:3454 +0x28 fp=0xc000049fe0 sp=0xc000049fb8 pc=0x58058dcbdec8
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by net/http.(*Server).Serve in goroutine 1
Aug 26 20:46:12 D7510DC ollama[2730]:         net/http/server.go:3454 +0x485
Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 45 gp=0xc0001d1880 m=nil [IO wait]:
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/proc.go:435 +0xce fp=0xc000117dd8 sp=0xc000117db8 pc=0x58058d9bb86e
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.netpollblock(0x58058d9debd8?, 0x8d954666?, 0x5?)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/netpoll.go:575 +0xf7 fp=0xc000117e10 sp=0xc000117dd8 pc=0x58058d980357
Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.runtime_pollWait(0x78dbafca7d98, 0x72)
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/netpoll.go:351 +0x85 fp=0xc000117e30 sp=0xc000117e10 pc=0x58058d9baa85
Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.(*pollDesc).wait(0xc00061a380?, 0xc0000b2431?, 0x0)
Aug 26 20:46:12 D7510DC ollama[2730]:         internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000117e58 sp=0xc000117e30 pc=0x58058da41ec7
Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.(*pollDesc).waitRead(...)
Aug 26 20:46:12 D7510DC ollama[2730]:         internal/poll/fd_poll_runtime.go:89
Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.(*FD).Read(0xc00061a380, {0xc0000b2431, 0x1, 0x1})
Aug 26 20:46:12 D7510DC ollama[2730]:         internal/poll/fd_unix.go:165 +0x27a fp=0xc000117ef0 sp=0xc000117e58 pc=0x58058da431ba
Aug 26 20:46:12 D7510DC ollama[2730]: net.(*netFD).Read(0xc00061a380, {0xc0000b2431?, 0xc0000515d8?, 0xc000117f70?})
Aug 26 20:46:12 D7510DC ollama[2730]:         net/fd_posix.go:55 +0x25 fp=0xc000117f38 sp=0xc000117ef0 pc=0x58058dab82a5
Aug 26 20:46:12 D7510DC ollama[2730]: net.(*conn).Read(0xc000604050, {0xc0000b2431?, 0x0?, 0x0?})
Aug 26 20:46:12 D7510DC ollama[2730]:         net/net.go:194 +0x45 fp=0xc000117f80 sp=0xc000117f38 pc=0x58058dac6665
Aug 26 20:46:12 D7510DC ollama[2730]: net/http.(*connReader).backgroundRead(0xc0000b2420)
Aug 26 20:46:12 D7510DC ollama[2730]:         net/http/server.go:690 +0x37 fp=0xc000117fc8 sp=0xc000117f80 pc=0x58058dcb24d7
Aug 26 20:46:12 D7510DC ollama[2730]: net/http.(*connReader).startBackgroundRead.gowrap2()
Aug 26 20:46:12 D7510DC ollama[2730]:         net/http/server.go:686 +0x25 fp=0xc000117fe0 sp=0xc000117fc8 pc=0x58058dcb2405
Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({})
Aug 26 20:46:12 D7510DC ollama[2730]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc000117fe8 sp=0xc000117fe0 pc=0x58058d9c2fa1
Aug 26 20:46:12 D7510DC ollama[2730]: created by net/http.(*connReader).startBackgroundRead in goroutine 10
Aug 26 20:46:12 D7510DC ollama[2730]:         net/http/server.go:686 +0xb6
Aug 26 20:46:12 D7510DC ollama[2730]: rax    0x0
Aug 26 20:46:12 D7510DC ollama[2730]: rbx    0x499a
Aug 26 20:46:12 D7510DC ollama[2730]: rcx    0x78dbafa9eb2c
Aug 26 20:46:12 D7510DC ollama[2730]: rdx    0x6
Aug 26 20:46:12 D7510DC ollama[2730]: rdi    0x4996
Aug 26 20:46:12 D7510DC ollama[2730]: rsi    0x499a
Aug 26 20:46:12 D7510DC ollama[2730]: rbp    0x78db680f71d0
Aug 26 20:46:12 D7510DC ollama[2730]: rsp    0x78db680f7190
Aug 26 20:46:12 D7510DC ollama[2730]: r8     0x0
Aug 26 20:46:12 D7510DC ollama[2730]: r9     0x7
Aug 26 20:46:12 D7510DC ollama[2730]: r10    0x8
Aug 26 20:46:12 D7510DC ollama[2730]: r11    0x246
Aug 26 20:46:12 D7510DC ollama[2730]: r12    0x6
Aug 26 20:46:12 D7510DC ollama[2730]: r13    0x78daf4914420
Aug 26 20:46:12 D7510DC ollama[2730]: r14    0x16
Aug 26 20:46:12 D7510DC ollama[2730]: r15    0xbbf2c0000
Aug 26 20:46:12 D7510DC ollama[2730]: rip    0x78dbafa9eb2c
Aug 26 20:46:12 D7510DC ollama[2730]: rflags 0x246
Aug 26 20:46:12 D7510DC ollama[2730]: cs     0x33
Aug 26 20:46:12 D7510DC ollama[2730]: fs     0x0
Aug 26 20:46:12 D7510DC ollama[2730]: gs     0x0
Aug 26 20:46:12 D7510DC ollama[2730]: time=2025-08-26T20:46:12.895-04:00 level=ERROR source=server.go:1439 msg="post predict" error="Post \"http://127.0.0.1:32793/completion\": EOF"
Aug 26 20:46:12 D7510DC ollama[2730]: [GIN] 2025/08/26 - 20:46:12 | 200 |  767.998503ms |       127.0.0.1 | POST     "/api/chat"
Aug 26 20:46:12 D7510DC ollama[2730]: time=2025-08-26T20:46:12.914-04:00 level=ERROR source=server.go:409 msg="llama runner terminated" error="exit status 2"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.11.7

Originally created by @BiggRanger on GitHub (Aug 27, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12094 ### What is the issue? Since upgrading to ollama 0.11.7 I have not been able to run phi4-mini-reasoning:3.8b or phi4-reasoning:14b. Immediately after entering a prompt I get an error and it drops back to the command prompt. gpt-oss:20b, gemma3n:e4b, and granite3.3:2b seem to work fine in version 0.11.7 PC is Dell Precision 7510, 32GB RAM, Nvidia M200M 4GB VRAM. Kubuntu 24.04. ### Relevant log output ```shell -- Boot ac336d1928a54bdd8ca2ce5d13f20ff7 -- Aug 26 18:03:15 D7510DC systemd[1]: Started ollama.service - Ollama Service. Aug 26 18:03:15 D7510DC ollama[2730]: time=2025-08-26T18:03:15.640-04:00 level=INFO source=routes.go:1331 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Aug 26 18:03:15 D7510DC ollama[2730]: time=2025-08-26T18:03:15.651-04:00 level=INFO source=images.go:477 msg="total blobs: 27" Aug 26 18:03:15 D7510DC ollama[2730]: time=2025-08-26T18:03:15.652-04:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" Aug 26 18:03:15 D7510DC ollama[2730]: time=2025-08-26T18:03:15.653-04:00 level=INFO source=routes.go:1384 msg="Listening on 127.0.0.1:11434 (version 0.11.7)" Aug 26 18:03:15 D7510DC ollama[2730]: time=2025-08-26T18:03:15.654-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Aug 26 18:03:15 D7510DC ollama[2730]: time=2025-08-26T18:03:15.829-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-cec41a1e-6c00-eb82-ec97-e5f8f1a23ee8 library=cuda variant=v12 compute=5.0 driver=12.9 name="Quadro M2000M" total="3.9 GiB" available="3.9 GiB" Aug 26 18:03:15 D7510DC ollama[2730]: time=2025-08-26T18:03:15.829-04:00 level=INFO source=routes.go:1425 msg="entering low vram mode" "total vram"="3.9 GiB" threshold="20.0 GiB" Aug 26 20:46:02 D7510DC ollama[2730]: [GIN] 2025/08/26 - 20:46:02 | 200 | 993.46µs | 127.0.0.1 | HEAD "/" Aug 26 20:46:02 D7510DC ollama[2730]: [GIN] 2025/08/26 - 20:46:02 | 200 | 106.100586ms | 127.0.0.1 | POST "/api/show" Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: loaded meta data with 36 key-value pairs and 196 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-f4dd2368e6c32725dc1c5c5548ae9ee2724d6a79052952eb50b65e26288022c4 (version GGUF V3 (latest)) Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 0: general.architecture str = phi3 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 1: phi3.rope.scaling.attn_factor f32 = 1.190238 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 2: general.type str = model Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 3: general.name str = Phi 4 Mini Reasoning Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 4: general.finetune str = reasoning Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 5: general.basename str = Phi-4 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 6: general.size_label str = mini Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 7: general.license str = mit Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/microsoft/Phi-... Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 9: general.tags arr[str,4] = ["nlp", "math", "code", "text-generat... Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 10: general.languages arr[str,1] = ["en"] Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 11: phi3.context_length u32 = 131072 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 12: phi3.rope.scaling.original_context_length u32 = 4096 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 13: phi3.embedding_length u32 = 3072 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 14: phi3.feed_forward_length u32 = 8192 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 15: phi3.block_count u32 = 32 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 16: phi3.attention.head_count u32 = 24 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 17: phi3.attention.head_count_kv u32 = 8 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 18: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 19: phi3.rope.dimension_count u32 = 96 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 20: phi3.rope.freq_base f32 = 10000.000000 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 21: phi3.attention.sliding_window u32 = 262144 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 23: tokenizer.ggml.pre str = gpt-4o Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,200064] = ["!", "\"", "#", "$", "%", "&", "'", ... Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,200064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,199742] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "e r", ... Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 199999 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 199999 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 29: tokenizer.ggml.unknown_token_id u32 = 199999 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 199999 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 32: tokenizer.ggml.add_eos_token bool = false Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 33: tokenizer.chat_template str = {{ '<|system|>Your name is Phi, an AI... Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 34: general.quantization_version u32 = 2 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - kv 35: general.file_type u32 = 15 Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - type f32: 67 tensors Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - type f16: 32 tensors Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - type q4_K: 80 tensors Aug 26 20:46:02 D7510DC ollama[2730]: llama_model_loader: - type q6_K: 17 tensors Aug 26 20:46:02 D7510DC ollama[2730]: print_info: file format = GGUF V3 (latest) Aug 26 20:46:02 D7510DC ollama[2730]: print_info: file type = Q4_K - Medium Aug 26 20:46:02 D7510DC ollama[2730]: print_info: file size = 2.93 GiB (6.56 BPW) Aug 26 20:46:03 D7510DC ollama[2730]: load: printing all EOG tokens: Aug 26 20:46:03 D7510DC ollama[2730]: load: - 199999 ('<|endoftext|>') Aug 26 20:46:03 D7510DC ollama[2730]: load: - 200020 ('<|end|>') Aug 26 20:46:03 D7510DC ollama[2730]: load: special tokens cache size = 12 Aug 26 20:46:03 D7510DC ollama[2730]: load: token to piece cache size = 1.3333 MB Aug 26 20:46:03 D7510DC ollama[2730]: print_info: arch = phi3 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: vocab_only = 1 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: model type = ?B Aug 26 20:46:03 D7510DC ollama[2730]: print_info: model params = 3.84 B Aug 26 20:46:03 D7510DC ollama[2730]: print_info: general.name = Phi 4 Mini Reasoning Aug 26 20:46:03 D7510DC ollama[2730]: print_info: vocab type = BPE Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_vocab = 200064 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_merges = 199742 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: BOS token = 199999 '<|endoftext|>' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOS token = 199999 '<|endoftext|>' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOT token = 199999 '<|endoftext|>' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: UNK token = 199999 '<|endoftext|>' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: PAD token = 199999 '<|endoftext|>' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: LF token = 198 'Ċ' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOG token = 199999 '<|endoftext|>' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOG token = 200020 '<|end|>' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: max token length = 256 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_load: vocab only - skipping tensors Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.261-04:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-f4dd2368e6c32725dc1c5c5548ae9ee2724d6a79052952eb50b65e26288022c4 --port 32793" Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.278-04:00 level=INFO source=runner.go:864 msg="starting go runner" Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.305-04:00 level=INFO source=server.go:488 msg="system memory" total="31.0 GiB" free="27.4 GiB" free_swap="512.0 MiB" Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.306-04:00 level=INFO source=server.go:528 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=32 layers.split=[32] memory.available="[3.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.2 GiB" memory.required.partial="3.8 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[3.8 GiB]" memory.weights.total="2.9 GiB" memory.weights.repeating="2.5 GiB" memory.weights.nonrepeating="480.8 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB" Aug 26 20:46:03 D7510DC ollama[2730]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 26 20:46:03 D7510DC ollama[2730]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 26 20:46:03 D7510DC ollama[2730]: ggml_cuda_init: found 1 CUDA devices: Aug 26 20:46:03 D7510DC ollama[2730]: Device 0: Quadro M2000M, compute capability 5.0, VMM: yes, ID: GPU-cec41a1e-6c00-eb82-ec97-e5f8f1a23ee8 Aug 26 20:46:03 D7510DC ollama[2730]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 26 20:46:03 D7510DC ollama[2730]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.446-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.446-04:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:32793" Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.455-04:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:4 GPULayers:32[ID:GPU-cec41a1e-6c00-eb82-ec97-e5f8f1a23ee8 Layers:32(0..31)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}" Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_load_from_file_impl: using device CUDA0 (Quadro M2000M) - 3999 MiB free Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.490-04:00 level=INFO source=server.go:1231 msg="waiting for llama runner to start responding" Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.490-04:00 level=INFO source=server.go:1265 msg="waiting for server to become available" status="llm server loading model" Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: loaded meta data with 36 key-value pairs and 196 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-f4dd2368e6c32725dc1c5c5548ae9ee2724d6a79052952eb50b65e26288022c4 (version GGUF V3 (latest)) Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 0: general.architecture str = phi3 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 1: phi3.rope.scaling.attn_factor f32 = 1.190238 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 2: general.type str = model Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 3: general.name str = Phi 4 Mini Reasoning Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 4: general.finetune str = reasoning Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 5: general.basename str = Phi-4 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 6: general.size_label str = mini Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 7: general.license str = mit Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/microsoft/Phi-... Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 9: general.tags arr[str,4] = ["nlp", "math", "code", "text-generat... Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 10: general.languages arr[str,1] = ["en"] Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 11: phi3.context_length u32 = 131072 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 12: phi3.rope.scaling.original_context_length u32 = 4096 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 13: phi3.embedding_length u32 = 3072 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 14: phi3.feed_forward_length u32 = 8192 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 15: phi3.block_count u32 = 32 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 16: phi3.attention.head_count u32 = 24 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 17: phi3.attention.head_count_kv u32 = 8 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 18: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 19: phi3.rope.dimension_count u32 = 96 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 20: phi3.rope.freq_base f32 = 10000.000000 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 21: phi3.attention.sliding_window u32 = 262144 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 23: tokenizer.ggml.pre str = gpt-4o Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,200064] = ["!", "\"", "#", "$", "%", "&", "'", ... Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,200064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,199742] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "e r", ... Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 199999 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 199999 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 29: tokenizer.ggml.unknown_token_id u32 = 199999 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 199999 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 32: tokenizer.ggml.add_eos_token bool = false Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 33: tokenizer.chat_template str = {{ '<|system|>Your name is Phi, an AI... Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 34: general.quantization_version u32 = 2 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - kv 35: general.file_type u32 = 15 Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - type f32: 67 tensors Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - type f16: 32 tensors Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - type q4_K: 80 tensors Aug 26 20:46:03 D7510DC ollama[2730]: llama_model_loader: - type q6_K: 17 tensors Aug 26 20:46:03 D7510DC ollama[2730]: print_info: file format = GGUF V3 (latest) Aug 26 20:46:03 D7510DC ollama[2730]: print_info: file type = Q4_K - Medium Aug 26 20:46:03 D7510DC ollama[2730]: print_info: file size = 2.93 GiB (6.56 BPW) Aug 26 20:46:03 D7510DC ollama[2730]: load_hparams: Phi SWA is currently disabled - results might be suboptimal for some models (see https://github.com/ggml-org/llama.cpp/pull/13676) Aug 26 20:46:03 D7510DC ollama[2730]: load: printing all EOG tokens: Aug 26 20:46:03 D7510DC ollama[2730]: load: - 199999 ('<|endoftext|>') Aug 26 20:46:03 D7510DC ollama[2730]: load: - 200020 ('<|end|>') Aug 26 20:46:03 D7510DC ollama[2730]: load: special tokens cache size = 12 Aug 26 20:46:03 D7510DC ollama[2730]: load: token to piece cache size = 1.3333 MB Aug 26 20:46:03 D7510DC ollama[2730]: print_info: arch = phi3 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: vocab_only = 0 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_ctx_train = 131072 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_embd = 3072 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_layer = 32 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_head = 24 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_head_kv = 8 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_rot = 96 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_swa = 0 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: is_swa_any = 0 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_embd_head_k = 128 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_embd_head_v = 128 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_gqa = 3 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_embd_k_gqa = 1024 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_embd_v_gqa = 1024 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: f_norm_eps = 0.0e+00 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: f_norm_rms_eps = 1.0e-05 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: f_clamp_kqv = 0.0e+00 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: f_max_alibi_bias = 0.0e+00 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: f_logit_scale = 0.0e+00 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: f_attn_scale = 0.0e+00 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_ff = 8192 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_expert = 0 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_expert_used = 0 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: causal attn = 1 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: pooling type = 0 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: rope type = 2 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: rope scaling = linear Aug 26 20:46:03 D7510DC ollama[2730]: print_info: freq_base_train = 10000.0 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: freq_scale_train = 1 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_ctx_orig_yarn = 4096 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: rope_finetuned = unknown Aug 26 20:46:03 D7510DC ollama[2730]: print_info: model type = 3B Aug 26 20:46:03 D7510DC ollama[2730]: print_info: model params = 3.84 B Aug 26 20:46:03 D7510DC ollama[2730]: print_info: general.name = Phi 4 Mini Reasoning Aug 26 20:46:03 D7510DC ollama[2730]: print_info: vocab type = BPE Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_vocab = 200064 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: n_merges = 199742 Aug 26 20:46:03 D7510DC ollama[2730]: print_info: BOS token = 199999 '<|endoftext|>' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOS token = 199999 '<|endoftext|>' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOT token = 199999 '<|endoftext|>' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: UNK token = 199999 '<|endoftext|>' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: PAD token = 199999 '<|endoftext|>' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: LF token = 198 'Ċ' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOG token = 199999 '<|endoftext|>' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: EOG token = 200020 '<|end|>' Aug 26 20:46:03 D7510DC ollama[2730]: print_info: max token length = 256 Aug 26 20:46:03 D7510DC ollama[2730]: load_tensors: loading model tensors, this can take a while... (mmap = true) Aug 26 20:46:06 D7510DC ollama[2730]: load_tensors: offloading 32 repeating layers to GPU Aug 26 20:46:06 D7510DC ollama[2730]: load_tensors: offloaded 32/33 layers to GPU Aug 26 20:46:06 D7510DC ollama[2730]: load_tensors: CUDA0 model buffer size = 2517.75 MiB Aug 26 20:46:06 D7510DC ollama[2730]: load_tensors: CPU_Mapped model buffer size = 480.82 MiB Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: constructing llama_context Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: n_seq_max = 1 Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: n_ctx = 4096 Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: n_ctx_per_seq = 4096 Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: n_batch = 512 Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: n_ubatch = 512 Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: causal_attn = 1 Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: flash_attn = 0 Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: kv_unified = false Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: freq_base = 10000.0 Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: freq_scale = 1 Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: CPU output buffer size = 0.77 MiB Aug 26 20:46:07 D7510DC ollama[2730]: llama_kv_cache_unified: CUDA0 KV buffer size = 512.00 MiB Aug 26 20:46:07 D7510DC ollama[2730]: llama_kv_cache_unified: size = 512.00 MiB ( 4096 cells, 32 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: CUDA0 compute buffer size = 883.56 MiB Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: CUDA_Host compute buffer size = 18.01 MiB Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: graph nodes = 1126 Aug 26 20:46:07 D7510DC ollama[2730]: llama_context: graph splits = 4 (with bs=512), 3 (with bs=1) Aug 26 20:46:07 D7510DC ollama[2730]: time=2025-08-26T20:46:07.253-04:00 level=INFO source=server.go:1269 msg="llama runner started in 3.99 seconds" Aug 26 20:46:07 D7510DC ollama[2730]: time=2025-08-26T20:46:07.253-04:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Aug 26 20:46:07 D7510DC ollama[2730]: time=2025-08-26T20:46:07.253-04:00 level=INFO source=server.go:1231 msg="waiting for llama runner to start responding" Aug 26 20:46:07 D7510DC ollama[2730]: time=2025-08-26T20:46:07.253-04:00 level=INFO source=server.go:1269 msg="llama runner started in 3.99 seconds" Aug 26 20:46:07 D7510DC ollama[2730]: [GIN] 2025/08/26 - 20:46:07 | 200 | 4.805253518s | 127.0.0.1 | POST "/api/generate" Aug 26 20:46:12 D7510DC ollama[2730]: CUDA error: out of memory Aug 26 20:46:12 D7510DC ollama[2730]: current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:503 Aug 26 20:46:12 D7510DC ollama[2730]: cuMemCreate(&handle, reserve_size, &prop, 0) Aug 26 20:46:12 D7510DC ollama[2730]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:84: CUDA error Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18857] Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18852] Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18851] Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18850] Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18848] Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18847] Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18846] Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18845] Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18844] Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18843] Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18842] Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18841] Aug 26 20:46:12 D7510DC ollama[18858]: [New LWP 18840] Aug 26 20:46:12 D7510DC ollama[18858]: [Thread debugging using libthread_db enabled] Aug 26 20:46:12 D7510DC ollama[18858]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Aug 26 20:46:12 D7510DC ollama[18858]: 0x000058058d9c4da3 in ?? () Aug 26 20:46:12 D7510DC ollama[18858]: #0 0x000058058d9c4da3 in ?? () Aug 26 20:46:12 D7510DC ollama[18858]: #1 0x000058058d981070 in ?? () Aug 26 20:46:12 D7510DC ollama[18858]: #2 0x000058058f704e60 in ?? () Aug 26 20:46:12 D7510DC ollama[18858]: #3 0x0000000000000080 in ?? () Aug 26 20:46:12 D7510DC ollama[18858]: #4 0x0000000000000000 in ?? () Aug 26 20:46:12 D7510DC ollama[18858]: [Inferior 1 (process 18838) detached] Aug 26 20:46:12 D7510DC ollama[2730]: SIGABRT: abort Aug 26 20:46:12 D7510DC ollama[2730]: PC=0x78dbafa9eb2c m=4 sigcode=18446744073709551610 Aug 26 20:46:12 D7510DC ollama[2730]: signal arrived during cgo execution Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 9 gp=0xc000504700 m=4 mp=0xc000079808 [syscall]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.cgocall(0x58058e696f60, 0xc00008bbd8) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/cgocall.go:167 +0x4b fp=0xc00008bbb0 sp=0xc00008bb78 pc=0x58058d9b83eb Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/llama._Cfunc_llama_decode(0x5805a8e1b7b0, {0x12, 0x5805a90905b0, 0x0, 0x5805a90a2f50, 0x5805a908cfd0, 0x5805a9072e90, 0x5805a908ed40}) Aug 26 20:46:12 D7510DC ollama[2730]: _cgo_gotypes.go:668 +0x4a fp=0xc00008bbd8 sp=0xc00008bbb0 pc=0x58058dd6796a Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/llama.(*Context).Decode.func1(...) Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/llama/llama.go:150 Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/llama.(*Context).Decode(0xc00011cd88?, 0x1?) Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/llama/llama.go:150 +0xed fp=0xc00008bcc0 sp=0xc00008bbd8 pc=0x58058dd6a74d Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner.(*Server).processBatch(0xc0002e05a0, 0xc0007205f0, 0xc00011cf28) Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner/runner.go:441 +0x209 fp=0xc00008bee8 sp=0xc00008bcc0 pc=0x58058de31029 Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner.(*Server).run(0xc0002e05a0, {0x58058ee620c0, 0xc000720190}) Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner/runner.go:346 +0x1d5 fp=0xc00008bfb8 sp=0xc00008bee8 pc=0x58058de30cb5 Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner.Execute.gowrap1() Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner/runner.go:880 +0x28 fp=0xc00008bfe0 sp=0xc00008bfb8 pc=0x58058de35a08 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc00008bfe8 sp=0xc00008bfe0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner/runner.go:880 +0x4c5 Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 1 gp=0xc000002380 m=nil [IO wait]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc00038f790 sp=0xc00038f770 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.netpollblock(0xc00050f7e0?, 0x8d954666?, 0x5?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/netpoll.go:575 +0xf7 fp=0xc00038f7c8 sp=0xc00038f790 pc=0x58058d980357 Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.runtime_pollWait(0x78dbafca7eb0, 0x72) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/netpoll.go:351 +0x85 fp=0xc00038f7e8 sp=0xc00038f7c8 pc=0x58058d9baa85 Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.(*pollDesc).wait(0xc00061a300?, 0x900000036?, 0x0) Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00038f810 sp=0xc00038f7e8 pc=0x58058da41ec7 Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.(*pollDesc).waitRead(...) Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll/fd_poll_runtime.go:89 Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.(*FD).Accept(0xc00061a300) Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll/fd_unix.go:620 +0x295 fp=0xc00038f8b8 sp=0xc00038f810 pc=0x58058da47295 Aug 26 20:46:12 D7510DC ollama[2730]: net.(*netFD).accept(0xc00061a300) Aug 26 20:46:12 D7510DC ollama[2730]: net/fd_unix.go:172 +0x29 fp=0xc00038f970 sp=0xc00038f8b8 pc=0x58058daba249 Aug 26 20:46:12 D7510DC ollama[2730]: net.(*TCPListener).accept(0xc000728100) Aug 26 20:46:12 D7510DC ollama[2730]: net/tcpsock_posix.go:159 +0x1b fp=0xc00038f9c0 sp=0xc00038f970 pc=0x58058dacfbfb Aug 26 20:46:12 D7510DC ollama[2730]: net.(*TCPListener).Accept(0xc000728100) Aug 26 20:46:12 D7510DC ollama[2730]: net/tcpsock.go:380 +0x30 fp=0xc00038f9f0 sp=0xc00038f9c0 pc=0x58058daceab0 Aug 26 20:46:12 D7510DC ollama[2730]: net/http.(*onceCloseListener).Accept(0xc0004c6090?) Aug 26 20:46:12 D7510DC ollama[2730]: <autogenerated>:1 +0x24 fp=0xc00038fa08 sp=0xc00038f9f0 pc=0x58058dce6204 Aug 26 20:46:12 D7510DC ollama[2730]: net/http.(*Server).Serve(0xc0004ac100, {0x58058ee5fc08, 0xc000728100}) Aug 26 20:46:12 D7510DC ollama[2730]: net/http/server.go:3424 +0x30c fp=0xc00038fb38 sp=0xc00038fa08 pc=0x58058dcbdacc Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner.Execute({0xc000034260, 0x4, 0x4}) Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner/runner.go:901 +0x8f5 fp=0xc00038fd08 sp=0xc00038fb38 pc=0x58058de35795 Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner.Execute({0xc000034250?, 0x0?, 0x0?}) Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/runner.go:22 +0xd4 fp=0xc00038fd30 sp=0xc00038fd08 pc=0x58058debfd34 Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/cmd.NewCLI.func2(0xc0001f1400?, {0x58058e97b081?, 0x4?, 0x58058e97b085?}) Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/cmd/cmd.go:1583 +0x45 fp=0xc00038fd58 sp=0xc00038fd30 pc=0x58058e624f65 Aug 26 20:46:12 D7510DC ollama[2730]: github.com/spf13/cobra.(*Command).execute(0xc00014cf08, {0xc0006967c0, 0x4, 0x4}) Aug 26 20:46:12 D7510DC ollama[2730]: github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc00038fe78 sp=0xc00038fd58 pc=0x58058db3389c Aug 26 20:46:12 D7510DC ollama[2730]: github.com/spf13/cobra.(*Command).ExecuteC(0xc00048ef08) Aug 26 20:46:12 D7510DC ollama[2730]: github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc00038ff30 sp=0xc00038fe78 pc=0x58058db340e5 Aug 26 20:46:12 D7510DC ollama[2730]: github.com/spf13/cobra.(*Command).Execute(...) Aug 26 20:46:12 D7510DC ollama[2730]: github.com/spf13/cobra@v1.7.0/command.go:992 Aug 26 20:46:12 D7510DC ollama[2730]: github.com/spf13/cobra.(*Command).ExecuteContext(...) Aug 26 20:46:12 D7510DC ollama[2730]: github.com/spf13/cobra@v1.7.0/command.go:985 Aug 26 20:46:12 D7510DC ollama[2730]: main.main() Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/main.go:12 +0x4d fp=0xc00038ff50 sp=0xc00038ff30 pc=0x58058e625a4d Aug 26 20:46:12 D7510DC ollama[2730]: runtime.main() Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:283 +0x29d fp=0xc00038ffe0 sp=0xc00038ff50 pc=0x58058d9879dd Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc00038ffe8 sp=0xc00038ffe0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc000072fa8 sp=0xc000072f88 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goparkunlock(...) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:441 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.forcegchelper() Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:348 +0xb8 fp=0xc000072fe0 sp=0xc000072fa8 pc=0x58058d987d18 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc000072fe8 sp=0xc000072fe0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.init.7 in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:336 +0x1a Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 3 gp=0xc000003340 m=nil [GC sweep wait]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc000073780 sp=0xc000073760 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goparkunlock(...) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:441 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.bgsweep(0xc00007e000) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgcsweep.go:316 +0xdf fp=0xc0000737c8 sp=0xc000073780 pc=0x58058d9724bf Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcenable.gowrap1() Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:204 +0x25 fp=0xc0000737e0 sp=0xc0000737c8 pc=0x58058d9668a5 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc0000737e8 sp=0xc0000737e0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcenable in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:204 +0x66 Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x10000?, 0x58058eb3eb70?, 0x0?, 0x0?, 0x0?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc000073f78 sp=0xc000073f58 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goparkunlock(...) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:441 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.(*scavengerState).park(0x58058f701f00) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgcscavenge.go:425 +0x49 fp=0xc000073fa8 sp=0xc000073f78 pc=0x58058d96ff09 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.bgscavenge(0xc00007e000) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgcscavenge.go:658 +0x59 fp=0xc000073fc8 sp=0xc000073fa8 pc=0x58058d970499 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcenable.gowrap2() Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:205 +0x25 fp=0xc000073fe0 sp=0xc000073fc8 pc=0x58058d966845 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc000073fe8 sp=0xc000073fe0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcenable in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:205 +0xa5 Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 5 gp=0xc000003dc0 m=nil [finalizer wait]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x0?, 0x58058ee4cac0?, 0x0?, 0x60?, 0x1000000010?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc000072630 sp=0xc000072610 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.runfinq() Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mfinal.go:196 +0x107 fp=0xc0000727e0 sp=0xc000072630 pc=0x58058d965867 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc0000727e8 sp=0xc0000727e0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.createfing in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mfinal.go:166 +0x3d Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 6 gp=0xc0001d08c0 m=nil [chan receive]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0xc000223540?, 0xc000312018?, 0x60?, 0x47?, 0x58058daa0e88?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc000074718 sp=0xc0000746f8 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.chanrecv(0xc00003c380, 0x0, 0x1) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/chan.go:664 +0x445 fp=0xc000074790 sp=0xc000074718 pc=0x58058d957245 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.chanrecv1(0x0?, 0x0?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/chan.go:506 +0x12 fp=0xc0000747b8 sp=0xc000074790 pc=0x58058d956dd2 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.unique_runtime_registerUniqueMapCleanup.func2(...) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1796 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1799 +0x2f fp=0xc0000747e0 sp=0xc0000747b8 pc=0x58058d969a4f Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc0000747e8 sp=0xc0000747e0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by unique.runtime_registerUniqueMapCleanup in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1794 +0x85 Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 7 gp=0xc0001d0e00 m=nil [GC worker (idle)]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc000074f38 sp=0xc000074f18 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1423 +0xe9 fp=0xc000074fc8 sp=0xc000074f38 pc=0x58058d968d69 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1() Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x25 fp=0xc000074fe0 sp=0xc000074fc8 pc=0x58058d968c45 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc000074fe8 sp=0xc000074fe0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x105 Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 18 gp=0xc000504000 m=nil [GC worker (idle)]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x8e84710ceb0?, 0x0?, 0x0?, 0x0?, 0x0?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc00006e738 sp=0xc00006e718 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1423 +0xe9 fp=0xc00006e7c8 sp=0xc00006e738 pc=0x58058d968d69 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1() Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x25 fp=0xc00006e7e0 sp=0xc00006e7c8 pc=0x58058d968c45 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc00006e7e8 sp=0xc00006e7e0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x105 Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 34 gp=0xc000102380 m=nil [GC worker (idle)]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x8e8471028f7?, 0x3?, 0x27?, 0x1f?, 0x0?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc00011a738 sp=0xc00011a718 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1423 +0xe9 fp=0xc00011a7c8 sp=0xc00011a738 pc=0x58058d968d69 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1() Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x25 fp=0xc00011a7e0 sp=0xc00011a7c8 pc=0x58058d968c45 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc00011a7e8 sp=0xc00011a7e0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x105 Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 35 gp=0xc000102540 m=nil [GC worker (idle)]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x8e8470ed3a4?, 0x0?, 0x0?, 0x0?, 0x0?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc00011af38 sp=0xc00011af18 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1423 +0xe9 fp=0xc00011afc8 sp=0xc00011af38 pc=0x58058d968d69 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1() Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x25 fp=0xc00011afe0 sp=0xc00011afc8 pc=0x58058d968c45 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc00011afe8 sp=0xc00011afe0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x105 Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 8 gp=0xc0001d0fc0 m=nil [GC worker (idle)]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x8e8470ef77d?, 0x0?, 0x0?, 0x0?, 0x0?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc000075738 sp=0xc000075718 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1423 +0xe9 fp=0xc0000757c8 sp=0xc000075738 pc=0x58058d968d69 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1() Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x25 fp=0xc0000757e0 sp=0xc0000757c8 pc=0x58058d968c45 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc0000757e8 sp=0xc0000757e0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x105 Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 19 gp=0xc0005041c0 m=nil [GC worker (idle)]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x8e8470ec42b?, 0x0?, 0x0?, 0x0?, 0x0?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc00006ef38 sp=0xc00006ef18 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1423 +0xe9 fp=0xc00006efc8 sp=0xc00006ef38 pc=0x58058d968d69 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1() Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x25 fp=0xc00006efe0 sp=0xc00006efc8 pc=0x58058d968c45 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc00006efe8 sp=0xc00006efe0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x105 Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 20 gp=0xc000504380 m=nil [GC worker (idle)]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x8e8470eef41?, 0x0?, 0x0?, 0x0?, 0x0?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc00006f738 sp=0xc00006f718 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1423 +0xe9 fp=0xc00006f7c8 sp=0xc00006f738 pc=0x58058d968d69 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1() Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x25 fp=0xc00006f7e0 sp=0xc00006f7c8 pc=0x58058d968c45 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc00006f7e8 sp=0xc00006f7e0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x105 Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 21 gp=0xc000504540 m=nil [GC worker (idle)]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x8e8470ee258?, 0x0?, 0x0?, 0x0?, 0x0?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc00006ff38 sp=0xc00006ff18 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkWorker(0xc00003d7a0) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1423 +0xe9 fp=0xc00006ffc8 sp=0xc00006ff38 pc=0x58058d968d69 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gcBgMarkStartWorkers.gowrap1() Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x25 fp=0xc00006ffe0 sp=0xc00006ffc8 pc=0x58058d968c45 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc00006ffe8 sp=0xc00006ffe0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by runtime.gcBgMarkStartWorkers in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: runtime/mgc.go:1339 +0x105 Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 10 gp=0xc0005048c0 m=nil [select]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0xc000049a78?, 0x2?, 0xa?, 0x0?, 0xc0000498c4?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc0000496f8 sp=0xc0000496d8 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.selectgo(0xc000049a78, 0xc0000498c0, 0x12?, 0x0, 0x1?, 0x1) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/select.go:351 +0x837 fp=0xc000049830 sp=0xc0000496f8 pc=0x58058d999ed7 Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner.(*Server).completion(0xc0002e05a0, {0x58058ee5fde8, 0xc0006269a0}, 0xc00016e780) Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner/runner.go:629 +0xb37 fp=0xc000049ac0 sp=0xc000049830 pc=0x58058de32c37 Aug 26 20:46:12 D7510DC ollama[2730]: github.com/ollama/ollama/runner/llamarunner.(*Server).completion-fm({0x58058ee5fde8?, 0xc0006269a0?}, 0xc000049b40?) Aug 26 20:46:12 D7510DC ollama[2730]: <autogenerated>:1 +0x36 fp=0xc000049af0 sp=0xc000049ac0 pc=0x58058de35e16 Aug 26 20:46:12 D7510DC ollama[2730]: net/http.HandlerFunc.ServeHTTP(0xc0007260c0?, {0x58058ee5fde8?, 0xc0006269a0?}, 0xc000049b60?) Aug 26 20:46:12 D7510DC ollama[2730]: net/http/server.go:2294 +0x29 fp=0xc000049b18 sp=0xc000049af0 pc=0x58058dcba109 Aug 26 20:46:12 D7510DC ollama[2730]: net/http.(*ServeMux).ServeHTTP(0x58058d95fd85?, {0x58058ee5fde8, 0xc0006269a0}, 0xc00016e780) Aug 26 20:46:12 D7510DC ollama[2730]: net/http/server.go:2822 +0x1c4 fp=0xc000049b68 sp=0xc000049b18 pc=0x58058dcbc004 Aug 26 20:46:12 D7510DC ollama[2730]: net/http.serverHandler.ServeHTTP({0x58058ee5c470?}, {0x58058ee5fde8?, 0xc0006269a0?}, 0x1?) Aug 26 20:46:12 D7510DC ollama[2730]: net/http/server.go:3301 +0x8e fp=0xc000049b98 sp=0xc000049b68 pc=0x58058dcd9a8e Aug 26 20:46:12 D7510DC ollama[2730]: net/http.(*conn).serve(0xc0004c6090, {0x58058ee62088, 0xc000706570}) Aug 26 20:46:12 D7510DC ollama[2730]: net/http/server.go:2102 +0x625 fp=0xc000049fb8 sp=0xc000049b98 pc=0x58058dcb8605 Aug 26 20:46:12 D7510DC ollama[2730]: net/http.(*Server).Serve.gowrap3() Aug 26 20:46:12 D7510DC ollama[2730]: net/http/server.go:3454 +0x28 fp=0xc000049fe0 sp=0xc000049fb8 pc=0x58058dcbdec8 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by net/http.(*Server).Serve in goroutine 1 Aug 26 20:46:12 D7510DC ollama[2730]: net/http/server.go:3454 +0x485 Aug 26 20:46:12 D7510DC ollama[2730]: goroutine 45 gp=0xc0001d1880 m=nil [IO wait]: Aug 26 20:46:12 D7510DC ollama[2730]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/proc.go:435 +0xce fp=0xc000117dd8 sp=0xc000117db8 pc=0x58058d9bb86e Aug 26 20:46:12 D7510DC ollama[2730]: runtime.netpollblock(0x58058d9debd8?, 0x8d954666?, 0x5?) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/netpoll.go:575 +0xf7 fp=0xc000117e10 sp=0xc000117dd8 pc=0x58058d980357 Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.runtime_pollWait(0x78dbafca7d98, 0x72) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/netpoll.go:351 +0x85 fp=0xc000117e30 sp=0xc000117e10 pc=0x58058d9baa85 Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.(*pollDesc).wait(0xc00061a380?, 0xc0000b2431?, 0x0) Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000117e58 sp=0xc000117e30 pc=0x58058da41ec7 Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.(*pollDesc).waitRead(...) Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll/fd_poll_runtime.go:89 Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll.(*FD).Read(0xc00061a380, {0xc0000b2431, 0x1, 0x1}) Aug 26 20:46:12 D7510DC ollama[2730]: internal/poll/fd_unix.go:165 +0x27a fp=0xc000117ef0 sp=0xc000117e58 pc=0x58058da431ba Aug 26 20:46:12 D7510DC ollama[2730]: net.(*netFD).Read(0xc00061a380, {0xc0000b2431?, 0xc0000515d8?, 0xc000117f70?}) Aug 26 20:46:12 D7510DC ollama[2730]: net/fd_posix.go:55 +0x25 fp=0xc000117f38 sp=0xc000117ef0 pc=0x58058dab82a5 Aug 26 20:46:12 D7510DC ollama[2730]: net.(*conn).Read(0xc000604050, {0xc0000b2431?, 0x0?, 0x0?}) Aug 26 20:46:12 D7510DC ollama[2730]: net/net.go:194 +0x45 fp=0xc000117f80 sp=0xc000117f38 pc=0x58058dac6665 Aug 26 20:46:12 D7510DC ollama[2730]: net/http.(*connReader).backgroundRead(0xc0000b2420) Aug 26 20:46:12 D7510DC ollama[2730]: net/http/server.go:690 +0x37 fp=0xc000117fc8 sp=0xc000117f80 pc=0x58058dcb24d7 Aug 26 20:46:12 D7510DC ollama[2730]: net/http.(*connReader).startBackgroundRead.gowrap2() Aug 26 20:46:12 D7510DC ollama[2730]: net/http/server.go:686 +0x25 fp=0xc000117fe0 sp=0xc000117fc8 pc=0x58058dcb2405 Aug 26 20:46:12 D7510DC ollama[2730]: runtime.goexit({}) Aug 26 20:46:12 D7510DC ollama[2730]: runtime/asm_amd64.s:1700 +0x1 fp=0xc000117fe8 sp=0xc000117fe0 pc=0x58058d9c2fa1 Aug 26 20:46:12 D7510DC ollama[2730]: created by net/http.(*connReader).startBackgroundRead in goroutine 10 Aug 26 20:46:12 D7510DC ollama[2730]: net/http/server.go:686 +0xb6 Aug 26 20:46:12 D7510DC ollama[2730]: rax 0x0 Aug 26 20:46:12 D7510DC ollama[2730]: rbx 0x499a Aug 26 20:46:12 D7510DC ollama[2730]: rcx 0x78dbafa9eb2c Aug 26 20:46:12 D7510DC ollama[2730]: rdx 0x6 Aug 26 20:46:12 D7510DC ollama[2730]: rdi 0x4996 Aug 26 20:46:12 D7510DC ollama[2730]: rsi 0x499a Aug 26 20:46:12 D7510DC ollama[2730]: rbp 0x78db680f71d0 Aug 26 20:46:12 D7510DC ollama[2730]: rsp 0x78db680f7190 Aug 26 20:46:12 D7510DC ollama[2730]: r8 0x0 Aug 26 20:46:12 D7510DC ollama[2730]: r9 0x7 Aug 26 20:46:12 D7510DC ollama[2730]: r10 0x8 Aug 26 20:46:12 D7510DC ollama[2730]: r11 0x246 Aug 26 20:46:12 D7510DC ollama[2730]: r12 0x6 Aug 26 20:46:12 D7510DC ollama[2730]: r13 0x78daf4914420 Aug 26 20:46:12 D7510DC ollama[2730]: r14 0x16 Aug 26 20:46:12 D7510DC ollama[2730]: r15 0xbbf2c0000 Aug 26 20:46:12 D7510DC ollama[2730]: rip 0x78dbafa9eb2c Aug 26 20:46:12 D7510DC ollama[2730]: rflags 0x246 Aug 26 20:46:12 D7510DC ollama[2730]: cs 0x33 Aug 26 20:46:12 D7510DC ollama[2730]: fs 0x0 Aug 26 20:46:12 D7510DC ollama[2730]: gs 0x0 Aug 26 20:46:12 D7510DC ollama[2730]: time=2025-08-26T20:46:12.895-04:00 level=ERROR source=server.go:1439 msg="post predict" error="Post \"http://127.0.0.1:32793/completion\": EOF" Aug 26 20:46:12 D7510DC ollama[2730]: [GIN] 2025/08/26 - 20:46:12 | 200 | 767.998503ms | 127.0.0.1 | POST "/api/chat" Aug 26 20:46:12 D7510DC ollama[2730]: time=2025-08-26T20:46:12.914-04:00 level=ERROR source=server.go:409 msg="llama runner terminated" error="exit status 2" ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.11.7
GiteaMirror added the bug label 2026-04-29 06:18:51 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 27, 2025):

Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.306-04:00 level=INFO source=server.go:528 msg=offload
 library=cuda layers.requested=-1 layers.model=33 layers.offload=32 layers.split=[32] memory.available="[3.9 GiB]"
 memory.gpu_overhead="0 B" memory.required.full="4.2 GiB" memory.required.partial="3.8 GiB" memory.required.kv="512.0 MiB"
 memory.required.allocations="[3.8 GiB]" memory.weights.total="2.9 GiB" memory.weights.repeating="2.5 GiB"
 memory.weights.nonrepeating="480.8 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB"

3.9GB available and ollama allocated 3.8GB. It looks like the model crashed after loading, so some transient allocation exceeded the available VRAM and the runner OOMed. There are ways to mitigates OOMs shown here, or you could try setting OLLAMA_NEW_ESTIMATES=1 in the server environment to switch over to the new memory management system, which may do a better job of estimating memory requirements.

<!-- gh-comment-id:3226338378 --> @rick-github commented on GitHub (Aug 27, 2025): ``` Aug 26 20:46:03 D7510DC ollama[2730]: time=2025-08-26T20:46:03.306-04:00 level=INFO source=server.go:528 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=32 layers.split=[32] memory.available="[3.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.2 GiB" memory.required.partial="3.8 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[3.8 GiB]" memory.weights.total="2.9 GiB" memory.weights.repeating="2.5 GiB" memory.weights.nonrepeating="480.8 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB" ``` 3.9GB available and ollama allocated 3.8GB. It looks like the model crashed after loading, so some transient allocation exceeded the available VRAM and the runner OOMed. There are ways to mitigates OOMs shown [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288), or you could try setting `OLLAMA_NEW_ESTIMATES=1` in the server environment to switch over to the new memory management system, which may do a better job of estimating memory requirements.
Author
Owner

@BiggRanger commented on GitHub (Aug 27, 2025):

Thanks, that makes sense. It used to run fine in older versions (prior to 0.11.5 I believe). I tested using the command:
"/set parameter num_gpu 0" in the prompt and switching it to CPU only, and the PHI models work now.

I would still call this a bug though since I believe ollama should be handling memory budgets and switching from GPU to CPU in situations like this.

<!-- gh-comment-id:3226398919 --> @BiggRanger commented on GitHub (Aug 27, 2025): Thanks, that makes sense. It used to run fine in older versions (prior to 0.11.5 I believe). I tested using the command: "/set parameter num_gpu 0" in the prompt and switching it to CPU only, and the PHI models work now. I would still call this a bug though since I believe ollama should be handling memory budgets and switching from GPU to CPU in situations like this.
Author
Owner

@rick-github commented on GitHub (Aug 27, 2025):

Memory estimation had always been a pain point, hence the new memory management system.

<!-- gh-comment-id:3226404379 --> @rick-github commented on GitHub (Aug 27, 2025): Memory estimation had always been a pain point, hence the new memory management system.
Author
Owner

@BiggRanger commented on GitHub (Aug 27, 2025):

It's still not happy with "OLLAMA_NEW_ESTIMATES=1"

dclark@D7510DC:~$ sudo systemctl edit ollama.service
[sudo] password for dclark: 
Successfully installed edited file '/etc/systemd/system/ollama.service.d/override.conf'.
dclark@D7510DC:~$ sudo systemctl daemon-reload
dclark@D7510DC:~$ sudo systemctl restart ollama

dclark@D7510DC:~$ ps aux | grep 'ollama serve'
ollama     23650  3.4  0.4 48631288 132372 ?     Ssl  21:35   0:01 /usr/local/bin/ollama serve
dclark     23734  0.0  0.0   9280  2176 pts/1    S+   21:36   0:00 grep --color=auto ollama serve
dclark@D7510DC:~$ sudo cat /proc/23650/environ | tr '\0' '\n' | grep 'OLLAMA_NEW_ESTIMATES'
OLLAMA_NEW_ESTIMATES=1

dclark@D7510DC:~$ ollama run phi4-mini-reasoning:3.8b
>>> Hello
Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details
<!-- gh-comment-id:3226430750 --> @BiggRanger commented on GitHub (Aug 27, 2025): It's still not happy with "OLLAMA_NEW_ESTIMATES=1" ``` dclark@D7510DC:~$ sudo systemctl edit ollama.service [sudo] password for dclark: Successfully installed edited file '/etc/systemd/system/ollama.service.d/override.conf'. dclark@D7510DC:~$ sudo systemctl daemon-reload dclark@D7510DC:~$ sudo systemctl restart ollama dclark@D7510DC:~$ ps aux | grep 'ollama serve' ollama 23650 3.4 0.4 48631288 132372 ? Ssl 21:35 0:01 /usr/local/bin/ollama serve dclark 23734 0.0 0.0 9280 2176 pts/1 S+ 21:36 0:00 grep --color=auto ollama serve dclark@D7510DC:~$ sudo cat /proc/23650/environ | tr '\0' '\n' | grep 'OLLAMA_NEW_ESTIMATES' OLLAMA_NEW_ESTIMATES=1 dclark@D7510DC:~$ ollama run phi4-mini-reasoning:3.8b >>> Hello Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details ```
Author
Owner

@rick-github commented on GitHub (Aug 27, 2025):

Server log?

<!-- gh-comment-id:3226481326 --> @rick-github commented on GitHub (Aug 27, 2025): Server log?
Author
Owner

@czj942650673 commented on GitHub (Aug 27, 2025):

time=2025-08-27T14:18:53.747+08:00 level=INFO source=routes.go:1331 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:8192 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\94265\.ollama\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-08-27T14:18:53.751+08:00 level=INFO source=images.go:477 msg="total blobs: 5"
time=2025-08-27T14:18:53.752+08:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2025-08-27T14:18:53.752+08:00 level=INFO source=routes.go:1384 msg="Listening on 127.0.0.1:11434 (version 0.11.7)"
time=2025-08-27T14:18:53.752+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-08-27T14:18:53.752+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-08-27T14:18:53.752+08:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-08-27T14:18:53.752+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=24
time=2025-08-27T14:18:53.846+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-a5d2d457-1c84-896e-3d55-359d99f0b824 library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5080 Laptop GPU" total="15.9 GiB" available="14.6 GiB"
time=2025-08-27T14:18:53.846+08:00 level=INFO source=routes.go:1425 msg="entering low vram mode" "total vram"="15.9 GiB" threshold="20.0 GiB"
[GIN] 2025/08/27 - 14:18:55 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/08/27 - 14:18:55 | 200 | 28.2318ms | 127.0.0.1 | POST "/api/show"
time=2025-08-27T14:18:55.399+08:00 level=INFO source=server.go:383 msg="starting runner" cmd="C:\Users\94265\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --model C:\Users\94265\.ollama\models\blobs\sha256-5cc5f5f9e4a4f024a707817dc2dba728a9c8574b47ce05a82f68d9c886ca664e --port 54346"
time=2025-08-27T14:18:55.424+08:00 level=INFO source=server.go:488 msg="system memory" total="31.4 GiB" free="16.2 GiB" free_swap="19.1 GiB"
time=2025-08-27T14:18:55.425+08:00 level=INFO source=server.go:528 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=36 layers.split=[36] memory.available="[7.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="9.5 GiB" memory.required.partial="6.1 GiB" memory.required.kv="144.0 MiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="5.7 GiB" memory.weights.repeating="5.2 GiB" memory.weights.nonrepeating="593.5 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB" projector.weights="1.2 GiB" projector.graph="1.6 GiB"
time=2025-08-27T14:18:55.434+08:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
time=2025-08-27T14:18:55.438+08:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:54346"
time=2025-08-27T14:18:55.447+08:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:8 GPULayers:36[ID:GPU-a5d2d457-1c84-896e-3d55-359d99f0b824 Layers:36(0..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-27T14:18:55.463+08:00 level=INFO source=ggml.go:130 msg="" architecture=qwen25vl file_type=F16 name="" description="" num_tensors=953 num_key_values=36
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5080 Laptop GPU, compute capability 12.0, VMM: yes, ID: GPU-a5d2d457-1c84-896e-3d55-359d99f0b824
load_backend: loaded CUDA backend from C:\Users\94265\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\94265\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-08-27T14:18:55.562+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-08-27T14:18:55.785+08:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU"
time=2025-08-27T14:18:55.785+08:00 level=INFO source=ggml.go:490 msg="offloading output layer to CPU"
time=2025-08-27T14:18:55.785+08:00 level=INFO source=ggml.go:497 msg="offloaded 36/37 layers to GPU"
time=2025-08-27T14:18:55.786+08:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="5.2 GiB"
time=2025-08-27T14:18:55.786+08:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="2.4 GiB"
time=2025-08-27T14:18:55.786+08:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="144.0 MiB"
time=2025-08-27T14:18:55.786+08:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="364.0 MiB"
time=2025-08-27T14:18:55.786+08:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="1.6 GiB"
time=2025-08-27T14:18:55.786+08:00 level=INFO source=backend.go:342 msg="total memory" size="9.6 GiB"
time=2025-08-27T14:18:55.786+08:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
time=2025-08-27T14:18:55.786+08:00 level=INFO source=server.go:1231 msg="waiting for llama runner to start responding"
time=2025-08-27T14:18:55.786+08:00 level=INFO source=server.go:1265 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-27T14:18:57.037+08:00 level=INFO source=server.go:1269 msg="llama runner started in 1.64 seconds"
[GIN] 2025/08/27 - 14:18:57 | 200 | 1.7474096s | 127.0.0.1 | POST "/api/generate"
C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:6930: fatal error
C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:6930: fatal error
C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:6930: fatal error
time=2025-08-27T14:19:01.474+08:00 level=ERROR source=server.go:1439 msg="post predict" error="Post "http://127.0.0.1:54346/completion": read tcp 127.0.0.1:54349->127.0.0.1:54346: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2025/08/27 - 14:19:01 | 200 | 225.3724ms | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/08/27 - 14:19:05 | 200 | 1.0429ms | 127.0.0.1 | GET "/api/tags"
[GIN] 2025/08/27 - 14:19:10 | 200 | 527.3µs | 127.0.0.1 | GET "/api/tags"
[GIN] 2025/08/27 - 14:22:02 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/08/27 - 14:22:02 | 200 | 44.9292ms | 127.0.0.1 | POST "/api/show"
time=2025-08-27T14:22:07.174+08:00 level=WARN source=sched.go:652 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0165719 runner.size="9.5 GiB" runner.vram="6.1 GiB" runner.parallel=1 runner.pid=8464 runner.model=C:\Users\94265.ollama\models\blobs\sha256-5cc5f5f9e4a4f024a707817dc2dba728a9c8574b47ce05a82f68d9c886ca664e
time=2025-08-27T14:22:07.281+08:00 level=INFO source=server.go:383 msg="starting runner" cmd="C:\Users\94265\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --model C:\Users\94265\.ollama\models\blobs\sha256-5cc5f5f9e4a4f024a707817dc2dba728a9c8574b47ce05a82f68d9c886ca664e --port 54457"
time=2025-08-27T14:22:07.309+08:00 level=INFO source=server.go:488 msg="system memory" total="31.4 GiB" free="15.3 GiB" free_swap="18.7 GiB"
time=2025-08-27T14:22:07.310+08:00 level=INFO source=server.go:528 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=36 layers.split=[36] memory.available="[7.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="9.5 GiB" memory.required.partial="6.1 GiB" memory.required.kv="144.0 MiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="5.7 GiB" memory.weights.repeating="5.2 GiB" memory.weights.nonrepeating="593.5 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB" projector.weights="1.2 GiB" projector.graph="1.6 GiB"
time=2025-08-27T14:22:07.325+08:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
time=2025-08-27T14:22:07.329+08:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:54457"
time=2025-08-27T14:22:07.333+08:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:8 GPULayers:36[ID:GPU-a5d2d457-1c84-896e-3d55-359d99f0b824 Layers:36(0..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-27T14:22:07.348+08:00 level=INFO source=ggml.go:130 msg="" architecture=qwen25vl file_type=F16 name="" description="" num_tensors=953 num_key_values=36
time=2025-08-27T14:22:07.425+08:00 level=WARN source=sched.go:652 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2669044 runner.size="9.5 GiB" runner.vram="6.1 GiB" runner.parallel=1 runner.pid=8464 runner.model=C:\Users\94265.ollama\models\blobs\sha256-5cc5f5f9e4a4f024a707817dc2dba728a9c8574b47ce05a82f68d9c886ca664e
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5080 Laptop GPU, compute capability 12.0, VMM: yes, ID: GPU-a5d2d457-1c84-896e-3d55-359d99f0b824
load_backend: loaded CUDA backend from C:\Users\94265\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\94265\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-08-27T14:22:07.474+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-08-27T14:22:07.675+08:00 level=WARN source=sched.go:652 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5168821 runner.size="9.5 GiB" runner.vram="6.1 GiB" runner.parallel=1 runner.pid=8464 runner.model=C:\Users\94265.ollama\models\blobs\sha256-5cc5f5f9e4a4f024a707817dc2dba728a9c8574b47ce05a82f68d9c886ca664e
time=2025-08-27T14:22:07.738+08:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU"
time=2025-08-27T14:22:07.738+08:00 level=INFO source=ggml.go:490 msg="offloading output layer to CPU"
time=2025-08-27T14:22:07.738+08:00 level=INFO source=ggml.go:497 msg="offloaded 36/37 layers to GPU"
time=2025-08-27T14:22:07.739+08:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="5.2 GiB"
time=2025-08-27T14:22:07.739+08:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="2.4 GiB"
time=2025-08-27T14:22:07.739+08:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="144.0 MiB"
time=2025-08-27T14:22:07.739+08:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="364.0 MiB"
time=2025-08-27T14:22:07.739+08:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="1.6 GiB"
time=2025-08-27T14:22:07.739+08:00 level=INFO source=backend.go:342 msg="total memory" size="9.6 GiB"
time=2025-08-27T14:22:07.739+08:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
time=2025-08-27T14:22:07.739+08:00 level=INFO source=server.go:1231 msg="waiting for llama runner to start responding"
time=2025-08-27T14:22:07.739+08:00 level=INFO source=server.go:1265 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-27T14:22:08.992+08:00 level=INFO source=server.go:1269 msg="llama runner started in 1.71 seconds"
[GIN] 2025/08/27 - 14:22:08 | 200 | 6.8848292s | 127.0.0.1 | POST "/api/generate"
C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:6930: fatal error
C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:6930: fatal error
time=2025-08-27T14:22:41.919+08:00 level=ERROR source=server.go:1439 msg="post predict" error="Post "http://127.0.0.1:54457/completion": read tcp 127.0.0.1:54460->127.0.0.1:54457: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2025/08/27 - 14:22:41 | 200 | 829.1717ms | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/08/27 - 14:25:12 | 200 | 1.0465ms | 127.0.0.1 | GET "/api/tags"

same problem

<!-- gh-comment-id:3226921891 --> @czj942650673 commented on GitHub (Aug 27, 2025): time=2025-08-27T14:18:53.747+08:00 level=INFO source=routes.go:1331 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:8192 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\94265\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-08-27T14:18:53.751+08:00 level=INFO source=images.go:477 msg="total blobs: 5" time=2025-08-27T14:18:53.752+08:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2025-08-27T14:18:53.752+08:00 level=INFO source=routes.go:1384 msg="Listening on 127.0.0.1:11434 (version 0.11.7)" time=2025-08-27T14:18:53.752+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-08-27T14:18:53.752+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-08-27T14:18:53.752+08:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-08-27T14:18:53.752+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=24 time=2025-08-27T14:18:53.846+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-a5d2d457-1c84-896e-3d55-359d99f0b824 library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5080 Laptop GPU" total="15.9 GiB" available="14.6 GiB" time=2025-08-27T14:18:53.846+08:00 level=INFO source=routes.go:1425 msg="entering low vram mode" "total vram"="15.9 GiB" threshold="20.0 GiB" [GIN] 2025/08/27 - 14:18:55 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/08/27 - 14:18:55 | 200 | 28.2318ms | 127.0.0.1 | POST "/api/show" time=2025-08-27T14:18:55.399+08:00 level=INFO source=server.go:383 msg="starting runner" cmd="C:\\Users\\94265\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\94265\\.ollama\\models\\blobs\\sha256-5cc5f5f9e4a4f024a707817dc2dba728a9c8574b47ce05a82f68d9c886ca664e --port 54346" time=2025-08-27T14:18:55.424+08:00 level=INFO source=server.go:488 msg="system memory" total="31.4 GiB" free="16.2 GiB" free_swap="19.1 GiB" time=2025-08-27T14:18:55.425+08:00 level=INFO source=server.go:528 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=36 layers.split=[36] memory.available="[7.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="9.5 GiB" memory.required.partial="6.1 GiB" memory.required.kv="144.0 MiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="5.7 GiB" memory.weights.repeating="5.2 GiB" memory.weights.nonrepeating="593.5 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB" projector.weights="1.2 GiB" projector.graph="1.6 GiB" time=2025-08-27T14:18:55.434+08:00 level=INFO source=runner.go:1006 msg="starting ollama engine" time=2025-08-27T14:18:55.438+08:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:54346" time=2025-08-27T14:18:55.447+08:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:8 GPULayers:36[ID:GPU-a5d2d457-1c84-896e-3d55-359d99f0b824 Layers:36(0..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-27T14:18:55.463+08:00 level=INFO source=ggml.go:130 msg="" architecture=qwen25vl file_type=F16 name="" description="" num_tensors=953 num_key_values=36 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5080 Laptop GPU, compute capability 12.0, VMM: yes, ID: GPU-a5d2d457-1c84-896e-3d55-359d99f0b824 load_backend: loaded CUDA backend from C:\Users\94265\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\94265\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-08-27T14:18:55.562+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-08-27T14:18:55.785+08:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU" time=2025-08-27T14:18:55.785+08:00 level=INFO source=ggml.go:490 msg="offloading output layer to CPU" time=2025-08-27T14:18:55.785+08:00 level=INFO source=ggml.go:497 msg="offloaded 36/37 layers to GPU" time=2025-08-27T14:18:55.786+08:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="5.2 GiB" time=2025-08-27T14:18:55.786+08:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="2.4 GiB" time=2025-08-27T14:18:55.786+08:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="144.0 MiB" time=2025-08-27T14:18:55.786+08:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="364.0 MiB" time=2025-08-27T14:18:55.786+08:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="1.6 GiB" time=2025-08-27T14:18:55.786+08:00 level=INFO source=backend.go:342 msg="total memory" size="9.6 GiB" time=2025-08-27T14:18:55.786+08:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 time=2025-08-27T14:18:55.786+08:00 level=INFO source=server.go:1231 msg="waiting for llama runner to start responding" time=2025-08-27T14:18:55.786+08:00 level=INFO source=server.go:1265 msg="waiting for server to become available" status="llm server loading model" time=2025-08-27T14:18:57.037+08:00 level=INFO source=server.go:1269 msg="llama runner started in 1.64 seconds" [GIN] 2025/08/27 - 14:18:57 | 200 | 1.7474096s | 127.0.0.1 | POST "/api/generate" C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:6930: fatal error C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:6930: fatal error C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:6930: fatal error time=2025-08-27T14:19:01.474+08:00 level=ERROR source=server.go:1439 msg="post predict" error="Post \"http://127.0.0.1:54346/completion\": read tcp 127.0.0.1:54349->127.0.0.1:54346: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2025/08/27 - 14:19:01 | 200 | 225.3724ms | 127.0.0.1 | POST "/api/chat" [GIN] 2025/08/27 - 14:19:05 | 200 | 1.0429ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/08/27 - 14:19:10 | 200 | 527.3µs | 127.0.0.1 | GET "/api/tags" [GIN] 2025/08/27 - 14:22:02 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/08/27 - 14:22:02 | 200 | 44.9292ms | 127.0.0.1 | POST "/api/show" time=2025-08-27T14:22:07.174+08:00 level=WARN source=sched.go:652 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0165719 runner.size="9.5 GiB" runner.vram="6.1 GiB" runner.parallel=1 runner.pid=8464 runner.model=C:\Users\94265\.ollama\models\blobs\sha256-5cc5f5f9e4a4f024a707817dc2dba728a9c8574b47ce05a82f68d9c886ca664e time=2025-08-27T14:22:07.281+08:00 level=INFO source=server.go:383 msg="starting runner" cmd="C:\\Users\\94265\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\94265\\.ollama\\models\\blobs\\sha256-5cc5f5f9e4a4f024a707817dc2dba728a9c8574b47ce05a82f68d9c886ca664e --port 54457" time=2025-08-27T14:22:07.309+08:00 level=INFO source=server.go:488 msg="system memory" total="31.4 GiB" free="15.3 GiB" free_swap="18.7 GiB" time=2025-08-27T14:22:07.310+08:00 level=INFO source=server.go:528 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=36 layers.split=[36] memory.available="[7.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="9.5 GiB" memory.required.partial="6.1 GiB" memory.required.kv="144.0 MiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="5.7 GiB" memory.weights.repeating="5.2 GiB" memory.weights.nonrepeating="593.5 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB" projector.weights="1.2 GiB" projector.graph="1.6 GiB" time=2025-08-27T14:22:07.325+08:00 level=INFO source=runner.go:1006 msg="starting ollama engine" time=2025-08-27T14:22:07.329+08:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:54457" time=2025-08-27T14:22:07.333+08:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:8 GPULayers:36[ID:GPU-a5d2d457-1c84-896e-3d55-359d99f0b824 Layers:36(0..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-27T14:22:07.348+08:00 level=INFO source=ggml.go:130 msg="" architecture=qwen25vl file_type=F16 name="" description="" num_tensors=953 num_key_values=36 time=2025-08-27T14:22:07.425+08:00 level=WARN source=sched.go:652 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2669044 runner.size="9.5 GiB" runner.vram="6.1 GiB" runner.parallel=1 runner.pid=8464 runner.model=C:\Users\94265\.ollama\models\blobs\sha256-5cc5f5f9e4a4f024a707817dc2dba728a9c8574b47ce05a82f68d9c886ca664e ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5080 Laptop GPU, compute capability 12.0, VMM: yes, ID: GPU-a5d2d457-1c84-896e-3d55-359d99f0b824 load_backend: loaded CUDA backend from C:\Users\94265\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\94265\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-08-27T14:22:07.474+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-08-27T14:22:07.675+08:00 level=WARN source=sched.go:652 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5168821 runner.size="9.5 GiB" runner.vram="6.1 GiB" runner.parallel=1 runner.pid=8464 runner.model=C:\Users\94265\.ollama\models\blobs\sha256-5cc5f5f9e4a4f024a707817dc2dba728a9c8574b47ce05a82f68d9c886ca664e time=2025-08-27T14:22:07.738+08:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU" time=2025-08-27T14:22:07.738+08:00 level=INFO source=ggml.go:490 msg="offloading output layer to CPU" time=2025-08-27T14:22:07.738+08:00 level=INFO source=ggml.go:497 msg="offloaded 36/37 layers to GPU" time=2025-08-27T14:22:07.739+08:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="5.2 GiB" time=2025-08-27T14:22:07.739+08:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="2.4 GiB" time=2025-08-27T14:22:07.739+08:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="144.0 MiB" time=2025-08-27T14:22:07.739+08:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="364.0 MiB" time=2025-08-27T14:22:07.739+08:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="1.6 GiB" time=2025-08-27T14:22:07.739+08:00 level=INFO source=backend.go:342 msg="total memory" size="9.6 GiB" time=2025-08-27T14:22:07.739+08:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 time=2025-08-27T14:22:07.739+08:00 level=INFO source=server.go:1231 msg="waiting for llama runner to start responding" time=2025-08-27T14:22:07.739+08:00 level=INFO source=server.go:1265 msg="waiting for server to become available" status="llm server loading model" time=2025-08-27T14:22:08.992+08:00 level=INFO source=server.go:1269 msg="llama runner started in 1.71 seconds" [GIN] 2025/08/27 - 14:22:08 | 200 | 6.8848292s | 127.0.0.1 | POST "/api/generate" C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:6930: fatal error C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:6930: fatal error time=2025-08-27T14:22:41.919+08:00 level=ERROR source=server.go:1439 msg="post predict" error="Post \"http://127.0.0.1:54457/completion\": read tcp 127.0.0.1:54460->127.0.0.1:54457: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2025/08/27 - 14:22:41 | 200 | 829.1717ms | 127.0.0.1 | POST "/api/chat" [GIN] 2025/08/27 - 14:25:12 | 200 | 1.0465ms | 127.0.0.1 | GET "/api/tags" same problem
Author
Owner

@jessegross commented on GitHub (Aug 27, 2025):

@BiggRanger OLLAMA_NEW_ESTIMATES is only available with the new Ollama engine and phi4 isn't implemented there yet, so this won't have an effect.

@czj942650673 Please file a new issue. None of the issues that you are adding your log to are the same.

<!-- gh-comment-id:3228980193 --> @jessegross commented on GitHub (Aug 27, 2025): @BiggRanger OLLAMA_NEW_ESTIMATES is only available with the new Ollama engine and phi4 isn't implemented there yet, so this won't have an effect. @czj942650673 Please file a new issue. None of the issues that you are adding your log to are the same.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54549