[GH-ISSUE #15352] [Regression 0.20.x] memory layout cannot be allocated for all models >10 GB on Windows — GPU abandoned after vision encoder CPU buffer allocation failure #35581

New Issue

GiteaMirror · 2026-04-22T20:10:24-05:00

GiteaMirror commented

2026-04-22 20:10:24 -05:00

Originally created by @Issueposter on GitHub (Apr 5, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15352

Environment

OS: Windows 10 Home 22H2 (build 19045)
GPU: NVIDIA RTX 3090 24 GB (CUDA 12.x)
Ollama version: 0.20.2 (auto-updated ~April 5 2026)
Previously working version: 0.19.x

Affected models (all fail)

Model	Size	Error
qwen3.5:35b-a3b-q4_K_M	~23 GB	memory layout cannot be allocated
qwen3.5:35b-8k	~23 GB	memory layout cannot be allocated
gemma4:31b	~19 GB	memory layout cannot be allocated

Working model

gemma4:e2b (7.2 GB) — loads and runs correctly

Observed behaviour

When loading any model larger than ~10 GB, Ollama logs show:

Vision encoder tensors load to GPU successfully
A subsequent CPU buffer allocation (~20 MB) fails with
Ollama then falls back to CPU-only mode ()
CPU RAM is insufficient for the full model → generation fails or produces garbage

The model never recovers GPU access once the CPU fallback is triggered. Restarting Ollama does not help.

Workarounds attempted

Restarting the Ollama service
Running Ollama in an interactive desktop session (not Session 0) to ensure GPU access
Reducing to 8192
Pulling the model fresh

None resolved the issue. The 7.2 GB model works, confirming the GPU is functional.

Regression

This worked correctly on Ollama 0.19.x. The issue appeared immediately after an automatic update to 0.20.2.

Additional info

The failure appears to be in memory layout/planning for large models — specifically a CPU staging buffer allocation that regressed in 0.20.x. Once that allocation fails, the GPU is abandoned entirely for the session.

Originally created by @Issueposter on GitHub (Apr 5, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15352 ## Environment - **OS:** Windows 10 Home 22H2 (build 19045) - **GPU:** NVIDIA RTX 3090 24 GB (CUDA 12.x) - **Ollama version:** 0.20.2 (auto-updated ~April 5 2026) - **Previously working version:** 0.19.x ## Affected models (all fail) | Model | Size | Error | |-------|------|-------| | qwen3.5:35b-a3b-q4_K_M | ~23 GB | memory layout cannot be allocated | | qwen3.5:35b-8k | ~23 GB | memory layout cannot be allocated | | gemma4:31b | ~19 GB | memory layout cannot be allocated | ## Working model - **gemma4:e2b (7.2 GB)** — loads and runs correctly ## Observed behaviour When loading any model larger than ~10 GB, Ollama logs show: 1. Vision encoder tensors load to GPU successfully 2. A subsequent CPU buffer allocation (~20 MB) fails with 3. Ollama then falls back to CPU-only mode () 4. CPU RAM is insufficient for the full model → generation fails or produces garbage The model never recovers GPU access once the CPU fallback is triggered. Restarting Ollama does not help. ## Workarounds attempted - Restarting the Ollama service - Running Ollama in an interactive desktop session (not Session 0) to ensure GPU access - Reducing to 8192 - Pulling the model fresh None resolved the issue. The 7.2 GB model works, confirming the GPU is functional. ## Regression This worked correctly on Ollama 0.19.x. The issue appeared immediately after an automatic update to 0.20.2. ## Additional info The failure appears to be in memory layout/planning for large models — specifically a CPU staging buffer allocation that regressed in 0.20.x. Once that allocation fails, the GPU is abandoned entirely for the session.

GiteaMirror commented

2026-04-22 20:10:25 -05:00

@rick-github commented on GitHub (Apr 5, 2026):

Server logs will aid in debugging.

@rick-github commented on GitHub (Apr 5, 2026): [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.

GiteaMirror commented

2026-04-22 20:10:26 -05:00

@Issueposter commented on GitHub (Apr 6, 2026):

Server logs (fresh reproduction, 2026-04-06)

As requested. Reproduces 100% on demand with gemma4:31b (19.6 GiB) on RTX 3090 24 GB, Ollama 0.20.2, Windows 10.

ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from ...cuda_v13\ggml-cuda.dll
time=... level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882
time=... level=INFO source=model.go:138 msg="vision: decode" elapsed=3.6871ms bounds=(0,0)-(2048,2048)
time=... level=INFO source=model.go:145 msg="vision: preprocess" elapsed=139.1741ms size="[768 768]"
time=... level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=... level=INFO source=model.go:156 msg="vision: encoded" elapsed=144.502ms shape="[5376 256]"

# Fit pass: calculates 58 GPU layers (15.7 GiB) + 2 CPU layers (3.8 GiB)
time=... level=INFO source=runner.go:1290 msg=load request="{Operation:fit ... GPULayers:58[...Layers:58(2..59)]}"

# Alloc pass begins — CPU buffer for embedding layers (0-1) fails first:
time=... level=INFO source=runner.go:1290 msg=load request="{Operation:alloc ... GPULayers:58[...Layers:58(2..59)]}"
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 4121011456
alloc_tensor_range: failed to allocate CPU buffer of size 4121011456

# Fallback: try to fit ALL layers on GPU (no CPU needed) — VRAM exhausted:
time=... level=INFO source=runner.go:1290 msg=load request="{Operation:alloc ... GPULayers:60[...Layers:60(0..59)]}"
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16687.64 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 17498263552

time=... level=INFO source=runner.go:1290 msg=load request="{Operation:alloc ... GPULayers:59[...Layers:59(1..59)]}"
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16396.80 MiB on device 0: cudaMalloc failed: out of memory

time=... level=INFO source=runner.go:1290 msg=load request="{Operation:alloc ... GPULayers:58[...Layers:58(2..59)]}"
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16105.95 MiB on device 0: cudaMalloc failed: out of memory

# Backoff loop (0.10, 0.20, ...) — same CUDA failures repeat
time=... level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.10
...
time=... level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.20

# Model weights that fit calculation proposed:
time=... level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="15.7 GiB"
time=... level=INFO source=device.go:245 msg="model weights" device=CPU size="3.8 GiB"
time=... level=INFO source=device.go:272 msg="total memory" size="19.6 GiB"

time=... level=INFO source=sched.go:511 msg="Load failed" error="model failed to load, this may be due to resource limitations or an internal error"
time=... level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"
[GIN] ... | 500 | 1m0s | POST "/api/generate"

Root cause analysis

The fit pass correctly determines 58 layers (15.7 GiB) fit in VRAM with 2 embedding layers (3.8 GiB) staying on CPU. But during alloc:

The CPU buffer allocation of 3.8 GiB fails (likely Windows memory pressure or contiguous allocation requirement)
Ollama then tries to push all 60 layers to GPU (16.4–16.7 GiB) to avoid CPU — cudaMalloc fails (insufficient headroom in 24 GiB VRAM)
Backoff loop retries the same combinations and fails the same way

The 7.2 GiB gemma4:e2b model that does work fits entirely in GPU VRAM without requiring any CPU buffer, bypassing both failure paths.

This appears to be a regression in how 0.20.x handles the CPU buffer pre-allocation for embedding layers when split loading. On 0.19.x the same models loaded and ran correctly on this identical hardware.

OLLAMA_NUM_CTX=8192 was set to reduce KV cache pressure — no effect on this failure.

@Issueposter commented on GitHub (Apr 6, 2026): ## Server logs (fresh reproduction, 2026-04-06) As requested. Reproduces 100% on demand with `gemma4:31b` (19.6 GiB) on RTX 3090 24 GB, Ollama 0.20.2, Windows 10. ``` ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from ...cuda_v13\ggml-cuda.dll time=... level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 time=... level=INFO source=model.go:138 msg="vision: decode" elapsed=3.6871ms bounds=(0,0)-(2048,2048) time=... level=INFO source=model.go:145 msg="vision: preprocess" elapsed=139.1741ms size="[768 768]" time=... level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=... level=INFO source=model.go:156 msg="vision: encoded" elapsed=144.502ms shape="[5376 256]" # Fit pass: calculates 58 GPU layers (15.7 GiB) + 2 CPU layers (3.8 GiB) time=... level=INFO source=runner.go:1290 msg=load request="{Operation:fit ... GPULayers:58[...Layers:58(2..59)]}" # Alloc pass begins — CPU buffer for embedding layers (0-1) fails first: time=... level=INFO source=runner.go:1290 msg=load request="{Operation:alloc ... GPULayers:58[...Layers:58(2..59)]}" ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 4121011456 alloc_tensor_range: failed to allocate CPU buffer of size 4121011456 # Fallback: try to fit ALL layers on GPU (no CPU needed) — VRAM exhausted: time=... level=INFO source=runner.go:1290 msg=load request="{Operation:alloc ... GPULayers:60[...Layers:60(0..59)]}" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16687.64 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 17498263552 time=... level=INFO source=runner.go:1290 msg=load request="{Operation:alloc ... GPULayers:59[...Layers:59(1..59)]}" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16396.80 MiB on device 0: cudaMalloc failed: out of memory time=... level=INFO source=runner.go:1290 msg=load request="{Operation:alloc ... GPULayers:58[...Layers:58(2..59)]}" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16105.95 MiB on device 0: cudaMalloc failed: out of memory # Backoff loop (0.10, 0.20, ...) — same CUDA failures repeat time=... level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.10 ... time=... level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.20 # Model weights that fit calculation proposed: time=... level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="15.7 GiB" time=... level=INFO source=device.go:245 msg="model weights" device=CPU size="3.8 GiB" time=... level=INFO source=device.go:272 msg="total memory" size="19.6 GiB" time=... level=INFO source=sched.go:511 msg="Load failed" error="model failed to load, this may be due to resource limitations or an internal error" time=... level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1" [GIN] ... | 500 | 1m0s | POST "/api/generate" ``` ## Root cause analysis The `fit` pass correctly determines 58 layers (15.7 GiB) fit in VRAM with 2 embedding layers (3.8 GiB) staying on CPU. But during `alloc`: 1. The CPU buffer allocation of **3.8 GiB fails** (likely Windows memory pressure or contiguous allocation requirement) 2. Ollama then tries to push all 60 layers to GPU (16.4–16.7 GiB) to avoid CPU — `cudaMalloc` fails (insufficient headroom in 24 GiB VRAM) 3. Backoff loop retries the same combinations and fails the same way The 7.2 GiB `gemma4:e2b` model that **does** work fits entirely in GPU VRAM without requiring any CPU buffer, bypassing both failure paths. **This appears to be a regression in how 0.20.x handles the CPU buffer pre-allocation for embedding layers when split loading.** On 0.19.x the same models loaded and ran correctly on this identical hardware. `OLLAMA_NUM_CTX=8192` was set to reduce KV cache pressure — no effect on this failure.

GiteaMirror commented

2026-04-22 20:10:27 -05:00

@Issueposter commented on GitHub (Apr 6, 2026):

Further investigation — root cause identified

After exhaustive testing, the root cause appears to be a change from mmap to pre-allocation in 0.20.x, interacting with Windows WDDM GPU memory budgeting.

System specs (clarification)

GPU: RTX 3090, 24 GiB VRAM — verified 23.4 GiB free at test time via nvidia-smi
RAM: 16 GiB total, ~9.9 GiB free at idle

Why cudaMalloc fails despite free VRAM

On Windows WDDM, GPU memory allocations are backed by system RAM. With only ~9-10 GiB of free RAM, Windows limits the effective per-process VRAM budget to significantly less than physical VRAM — even though nvidia-smi shows 23 GiB "free". This causes cudaMalloc to fail for allocations above ~8-10 GiB.

All 0.20.x load requests show UseMmap:false. This means the runner pre-allocates the entire model weight buffer upfront as a single contiguous CUDA allocation. For a 19.6 GiB model, this requires enough free system RAM to back the CUDA allocation — which the 16 GiB system cannot provide.

The 0.19.x difference

On 0.19.x, these exact models loaded and ran correctly on this identical hardware. The most likely explanation is that 0.19.x used UseMmap:true, meaning model weights were loaded lazily via memory-mapped files — no large contiguous upfront allocation needed, bypassing the WDDM budget constraint.

Full backoff trace (gemma4:31b, num_ctx=2048, from today)

backoff 0.00: GPULayers:58(2..59) → cudaMalloc failed: 16105.95 MiB
backoff 0.00: GPULayers:60(0..59) → CPU buffer failed: 3511065792 bytes
backoff 0.10: GPULayers:56       → cudaMalloc failed: 15524.26 MiB  
backoff 0.40: GPULayers:48       → cudaMalloc failed: 13268.36 MiB
backoff 0.50: GPULayers:39       → cudaMalloc failed: 8642.55 MiB
backoff 0.60: GPULayers:31       → cudaMalloc failed: 8642.55 MiB
backoff 0.70: GPULayers:23       → CPU buffer failed: 14265558368 bytes (~13.3 GiB)
backoff 0.80: GPULayers:14       → CPU buffer failed: 16828392064 bytes (~15.7 GiB)
backoff 0.90: GPULayers:6        → CPU buffer failed: 19176829824 bytes (~17.9 GiB)
backoff 1.00: GPULayers:[]       → CPU buffer failed: 21009249344 bytes (~19.6 GiB — impossible on 16 GiB RAM)

The model cannot load in any configuration: too many GPU layers exceeds WDDM budget, too few GPU layers exceeds physical RAM.

Suggested fix

Re-enable UseMmap:true (or equivalent lazy loading) for model weights on Windows, restoring 0.19.x behavior. This would allow the weights to be paged in on demand rather than requiring a single upfront contiguous allocation that exceeds the WDDM-constrained budget.

The 7.2 GiB gemma4:e2b model continues to work because it falls within the available WDDM budget on this system — confirming the GPU and CUDA stack are fully functional.

@Issueposter commented on GitHub (Apr 6, 2026): ## Further investigation — root cause identified After exhaustive testing, the root cause appears to be a **change from mmap to pre-allocation in 0.20.x**, interacting with Windows WDDM GPU memory budgeting. ### System specs (clarification) - GPU: RTX 3090, **24 GiB VRAM** — verified 23.4 GiB free at test time via `nvidia-smi` - RAM: **16 GiB total**, ~9.9 GiB free at idle ### Why cudaMalloc fails despite free VRAM On Windows WDDM, GPU memory allocations are backed by system RAM. With only ~9-10 GiB of free RAM, Windows limits the effective per-process VRAM budget to significantly less than physical VRAM — even though `nvidia-smi` shows 23 GiB "free". This causes `cudaMalloc` to fail for allocations above ~8-10 GiB. All 0.20.x load requests show `UseMmap:false`. This means the runner pre-allocates the entire model weight buffer upfront as a single contiguous CUDA allocation. For a 19.6 GiB model, this requires enough free system RAM to back the CUDA allocation — which the 16 GiB system cannot provide. ### The 0.19.x difference On 0.19.x, these exact models loaded and ran correctly on this identical hardware. The most likely explanation is that 0.19.x used `UseMmap:true`, meaning model weights were loaded lazily via memory-mapped files — no large contiguous upfront allocation needed, bypassing the WDDM budget constraint. ### Full backoff trace (gemma4:31b, num_ctx=2048, from today) ``` backoff 0.00: GPULayers:58(2..59) → cudaMalloc failed: 16105.95 MiB backoff 0.00: GPULayers:60(0..59) → CPU buffer failed: 3511065792 bytes backoff 0.10: GPULayers:56 → cudaMalloc failed: 15524.26 MiB backoff 0.40: GPULayers:48 → cudaMalloc failed: 13268.36 MiB backoff 0.50: GPULayers:39 → cudaMalloc failed: 8642.55 MiB backoff 0.60: GPULayers:31 → cudaMalloc failed: 8642.55 MiB backoff 0.70: GPULayers:23 → CPU buffer failed: 14265558368 bytes (~13.3 GiB) backoff 0.80: GPULayers:14 → CPU buffer failed: 16828392064 bytes (~15.7 GiB) backoff 0.90: GPULayers:6 → CPU buffer failed: 19176829824 bytes (~17.9 GiB) backoff 1.00: GPULayers:[] → CPU buffer failed: 21009249344 bytes (~19.6 GiB — impossible on 16 GiB RAM) ``` The model cannot load in any configuration: too many GPU layers exceeds WDDM budget, too few GPU layers exceeds physical RAM. ### Suggested fix Re-enable `UseMmap:true` (or equivalent lazy loading) for model weights on Windows, restoring 0.19.x behavior. This would allow the weights to be paged in on demand rather than requiring a single upfront contiguous allocation that exceeds the WDDM-constrained budget. The 7.2 GiB `gemma4:e2b` model continues to work because it falls within the available WDDM budget on this system — confirming the GPU and CUDA stack are fully functional.

GiteaMirror commented

2026-04-22 20:10:28 -05:00

@rick-github commented on GitHub (Apr 6, 2026):

Post the complete log.

@rick-github commented on GitHub (Apr 6, 2026): Post the complete log.

GiteaMirror commented

2026-04-22 20:10:28 -05:00

@Issueposter commented on GitHub (Apr 6, 2026):

Complete server log (correcting previous analysis)

Correction: My previous WDDM budget hypothesis was wrong. The original log (first attempt after Ollama restart) shows 32 MB and 64 MB allocations failing with 9.7 GiB free RAM and 22.6 GiB free VRAM. This is conclusively an allocator bug, not a resource limitation.

Server startup + first load attempt (2026-04-05 22:42, immediately after Ollama restart)

time=...22:42:12 level=INFO source=routes.go:1744 msg="server config" env="map[...OLLAMA_CONTEXT_LENGTH:16384 OLLAMA_FLASH_ATTENTION:false OLLAMA_KV_CACHE_TYPE:q8_0 ... OLLAMA_NEW_ENGINE:false ...]"
time=...22:42:12 level=INFO source=routes.go:1802 msg="Listening on [::]:11434 (version 0.20.2)"
time=...22:42:14 level=INFO source=types.go:42 msg="inference compute" id=GPU-84f27e06... name=CUDA0 description="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.6 GiB"
time=...22:42:14 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="24.0 GiB" default_num_ctx=32768

# Load request for gemma4:31b triggered
time=...22:42:33 level=INFO source=sched.go:484 msg="system memory" total="15.9 GiB" free="9.7 GiB" free_swap="21.7 GiB"
time=...22:42:33 level=INFO source=sched.go:491 msg="gpu memory" ... available="22.1 GiB" free="22.6 GiB" minimum="457.0 MiB" overhead="0 B"
time=...22:42:33 level=INFO source=server.go:759 msg="loading model" "model layers"=61 requested=-1

# Fit pass → 58 GPU layers (2..59), 3 CPU layers (0, 1, 60)
time=...22:42:33 level=INFO source=runner.go:1290 msg=load request="{Operation:fit ... KvSize:16384 GPULayers:61[...Layers:61(0..60)] UseMmap:false}"
time=...22:42:34 level=INFO source=runner.go:1290 msg=load request="{Operation:fit ... GPULayers:58[...Layers:58(2..59)] UseMmap:false}"

# Alloc pass — 32 MB CPU buffer fails with 9.7 GiB free RAM:
time=...22:42:34 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc ... GPULayers:58[...Layers:58(2..59)] UseMmap:false}"
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 33554432

# Fallback to all-GPU — 64 MB CUDA fails with 22.6 GiB free VRAM:
time=...22:42:44 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc ... GPULayers:60[...Layers:60(0..59)] UseMmap:false}"
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 64.00 MiB on device 0: cudaMalloc failed: out of memory

# Same pattern repeats across all backoff levels (0.10 → 1.00):
# backoff 0.00: CPU 32 MB fails / CUDA 64 MB fails
# backoff 0.10: CPU 32 MB fails / CUDA 64 MB fails  
# backoff 0.20: CUDA 64 MB fails / CPU 36 MB fails
# backoff 0.30: CPU 36 MB fails
# backoff 0.40: CPU 64 MB fails
# backoff 0.50: CPU 36 MB fails
# backoff 0.60: CPU 36 MB fails
# backoff 0.70: CPU 36 MB fails
# backoff 0.80: CPU 36 MB fails
# backoff 0.90: CPU 32 MB fails
# backoff 1.00: CPU 36 MB fails (GPULayers:[] — pure CPU)

time=...22:43:47 level=WARN source=server.go:875 msg="memory layout cannot be allocated"
  memory.InputWeights=1250426880
  memory.CPU.Weights="[304972832 304972832 ... 2260638912]"
  memory.CPU.Cache="[75497472 75497472 75497472 0 0 0 ...]"
  memory.CPU.Graph=392822784

time=...22:43:47 level=INFO source=device.go:245 msg="model weights" device=CPU size="19.6 GiB"
time=...22:43:47 level=INFO source=device.go:256 msg="kv cache" device=CPU size="216.0 MiB"
time=...22:43:47 level=INFO source=device.go:267 msg="compute graph" device=CPU size="374.6 MiB"
time=...22:43:47 level=INFO source=device.go:272 msg="total memory" size="20.1 GiB"
time=...22:43:47 level=INFO source=sched.go:511 msg="Load failed" error="memory layout cannot be allocated"
[GIN] 2026/04/05 - 22:43:48 | 500 | 1m15s | 192.168.4.21 | POST "/api/generate"

Key observations

32 MB CPU allocation fails with 9.7 GiB free RAM. This cannot be a resource limitation — it is an allocator bug.
64 MB cudaMalloc fails with 22.6 GiB free VRAM. Same conclusion.
The failed allocations are consistent sizes (32–64 MB) across all backoff levels regardless of GPU layer count, suggesting it is the same code path failing every time.
memory.CPU.Cache in the final warning shows per-layer KV cache sizes of 75,497,472 bytes (~72 MB) for the first 3 layers. The failing 32–36 MB allocations are likely K or V cache tensor allocations for individual CPU layers that are failing before the full buffer is attempted.
At backoff 1.00 (full CPU, no GPU), total requirement is 20.1 GiB against 15.9 GiB total system RAM — impossible regardless of the allocator bug. But this is a secondary issue; the primary failure occurs before even reaching a point where total RAM could be the constraint.

My earlier hypothesis about WDDM GPU memory budgeting was incorrect — the 32 MB and 64 MB failures rule that out entirely.

@Issueposter commented on GitHub (Apr 6, 2026): ## Complete server log (correcting previous analysis) **Correction:** My previous WDDM budget hypothesis was wrong. The original log (first attempt after Ollama restart) shows **32 MB and 64 MB allocations failing with 9.7 GiB free RAM and 22.6 GiB free VRAM**. This is conclusively an allocator bug, not a resource limitation. --- ### Server startup + first load attempt (2026-04-05 22:42, immediately after Ollama restart) ``` time=...22:42:12 level=INFO source=routes.go:1744 msg="server config" env="map[...OLLAMA_CONTEXT_LENGTH:16384 OLLAMA_FLASH_ATTENTION:false OLLAMA_KV_CACHE_TYPE:q8_0 ... OLLAMA_NEW_ENGINE:false ...]" time=...22:42:12 level=INFO source=routes.go:1802 msg="Listening on [::]:11434 (version 0.20.2)" time=...22:42:14 level=INFO source=types.go:42 msg="inference compute" id=GPU-84f27e06... name=CUDA0 description="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.6 GiB" time=...22:42:14 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="24.0 GiB" default_num_ctx=32768 # Load request for gemma4:31b triggered time=...22:42:33 level=INFO source=sched.go:484 msg="system memory" total="15.9 GiB" free="9.7 GiB" free_swap="21.7 GiB" time=...22:42:33 level=INFO source=sched.go:491 msg="gpu memory" ... available="22.1 GiB" free="22.6 GiB" minimum="457.0 MiB" overhead="0 B" time=...22:42:33 level=INFO source=server.go:759 msg="loading model" "model layers"=61 requested=-1 # Fit pass → 58 GPU layers (2..59), 3 CPU layers (0, 1, 60) time=...22:42:33 level=INFO source=runner.go:1290 msg=load request="{Operation:fit ... KvSize:16384 GPULayers:61[...Layers:61(0..60)] UseMmap:false}" time=...22:42:34 level=INFO source=runner.go:1290 msg=load request="{Operation:fit ... GPULayers:58[...Layers:58(2..59)] UseMmap:false}" # Alloc pass — 32 MB CPU buffer fails with 9.7 GiB free RAM: time=...22:42:34 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc ... GPULayers:58[...Layers:58(2..59)] UseMmap:false}" ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 33554432 # Fallback to all-GPU — 64 MB CUDA fails with 22.6 GiB free VRAM: time=...22:42:44 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc ... GPULayers:60[...Layers:60(0..59)] UseMmap:false}" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 64.00 MiB on device 0: cudaMalloc failed: out of memory # Same pattern repeats across all backoff levels (0.10 → 1.00): # backoff 0.00: CPU 32 MB fails / CUDA 64 MB fails # backoff 0.10: CPU 32 MB fails / CUDA 64 MB fails # backoff 0.20: CUDA 64 MB fails / CPU 36 MB fails # backoff 0.30: CPU 36 MB fails # backoff 0.40: CPU 64 MB fails # backoff 0.50: CPU 36 MB fails # backoff 0.60: CPU 36 MB fails # backoff 0.70: CPU 36 MB fails # backoff 0.80: CPU 36 MB fails # backoff 0.90: CPU 32 MB fails # backoff 1.00: CPU 36 MB fails (GPULayers:[] — pure CPU) time=...22:43:47 level=WARN source=server.go:875 msg="memory layout cannot be allocated" memory.InputWeights=1250426880 memory.CPU.Weights="[304972832 304972832 ... 2260638912]" memory.CPU.Cache="[75497472 75497472 75497472 0 0 0 ...]" memory.CPU.Graph=392822784 time=...22:43:47 level=INFO source=device.go:245 msg="model weights" device=CPU size="19.6 GiB" time=...22:43:47 level=INFO source=device.go:256 msg="kv cache" device=CPU size="216.0 MiB" time=...22:43:47 level=INFO source=device.go:267 msg="compute graph" device=CPU size="374.6 MiB" time=...22:43:47 level=INFO source=device.go:272 msg="total memory" size="20.1 GiB" time=...22:43:47 level=INFO source=sched.go:511 msg="Load failed" error="memory layout cannot be allocated" [GIN] 2026/04/05 - 22:43:48 | 500 | 1m15s | 192.168.4.21 | POST "/api/generate" ``` --- ### Key observations 1. **32 MB CPU allocation fails with 9.7 GiB free RAM.** This cannot be a resource limitation — it is an allocator bug. 2. **64 MB cudaMalloc fails with 22.6 GiB free VRAM.** Same conclusion. 3. The failed allocations are consistent sizes (32–64 MB) across all backoff levels regardless of GPU layer count, suggesting it is the same code path failing every time. 4. `memory.CPU.Cache` in the final warning shows per-layer KV cache sizes of 75,497,472 bytes (~72 MB) for the first 3 layers. The failing 32–36 MB allocations are likely K or V cache tensor allocations for individual CPU layers that are failing before the full buffer is attempted. 5. At backoff 1.00 (full CPU, no GPU), total requirement is **20.1 GiB** against **15.9 GiB** total system RAM — impossible regardless of the allocator bug. But this is a secondary issue; the primary failure occurs before even reaching a point where total RAM could be the constraint. My earlier hypothesis about WDDM GPU memory budgeting was incorrect — the 32 MB and 64 MB failures rule that out entirely.

GiteaMirror commented

2026-04-22 20:10:29 -05:00

@rick-github commented on GitHub (Apr 6, 2026):

Post the complete log.

@rick-github commented on GitHub (Apr 6, 2026): Post the complete log.

GiteaMirror commented

2026-04-22 20:10:30 -05:00

@Issueposter commented on GitHub (Apr 6, 2026):

Complete server.log — first load attempt (Ollama 0.20.2 fresh start, 2026-04-05):

time=2026-04-05T22:42:12.603+01:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16384 OLLAMA_DEBUG:INFO OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:30m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\patri\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:true OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES:]"

time=2026-04-05T22:42:12.603+01:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: true"

time=2026-04-05T22:42:12.610+01:00 level=INFO source=images.go:499 msg="total blobs: 18"

time=2026-04-05T22:42:12.612+01:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0"

time=2026-04-05T22:42:12.613+01:00 level=INFO source=routes.go:1802 msg="Listening on [::]:11434 (version 0.20.2)"

time=2026-04-05T22:42:12.615+01:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."

time=2026-04-05T22:42:12.629+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58389"

time=2026-04-05T22:42:13.542+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58403"

time=2026-04-05T22:42:13.810+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58417"

time=2026-04-05T22:42:14.117+01:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"

time=2026-04-05T22:42:14.118+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58432"

time=2026-04-05T22:42:14.118+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58431"

time=2026-04-05T22:42:14.422+01:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e filter_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3090" libdirs=ollama,cuda_v13 driver=13.1 pci_id=0000:09:00.0 type=discrete total="24.0 GiB" available="22.6 GiB"

time=2026-04-05T22:42:14.422+01:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="24.0 GiB" default_num_ctx=32768

[GIN] 2026/04/05 - 22:42:14 | 200 |            0s |       127.0.0.1 | GET      "/api/version"

[GIN] 2026/04/05 - 22:42:14 | 200 |            0s |       127.0.0.1 | GET      "/api/version"

[GIN] 2026/04/05 - 22:42:14 | 200 |       527.1A�s |       127.0.0.1 | GET      "/api/version"

[GIN] 2026/04/05 - 22:42:14 | 401 |    139.5814ms |       127.0.0.1 | POST     "/api/me"

[GIN] 2026/04/05 - 22:42:14 | 401 |    139.0543ms |       127.0.0.1 | POST     "/api/me"

[GIN] 2026/04/05 - 22:42:14 | 200 |     298.269ms |       127.0.0.1 | POST     "/api/show"

time=2026-04-05T22:42:33.105+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58484"

time=2026-04-05T22:42:33.370+01:00 level=INFO source=cpu_windows.go:148 msg=packages count=1

time=2026-04-05T22:42:33.370+01:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=8 efficiency=0 threads=16

time=2026-04-05T22:42:33.564+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883

time=2026-04-05T22:42:33.564+01:00 level=WARN source=server.go:258 msg="quantized kv cache requested but flash attention disabled" type=q8_0

time=2026-04-05T22:42:33.564+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\patri\\.ollama\\models\\blobs\\sha256-280af6832eca23cb322c4dcc65edfea98a21b8f8ab07dc7553bd6f7e6e7a3313 --port 58499"

time=2026-04-05T22:42:33.576+01:00 level=INFO source=sched.go:484 msg="system memory" total="15.9 GiB" free="9.7 GiB" free_swap="21.7 GiB"

time=2026-04-05T22:42:33.576+01:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA available="22.1 GiB" free="22.6 GiB" minimum="457.0 MiB" overhead="0 B"

time=2026-04-05T22:42:33.576+01:00 level=INFO source=server.go:759 msg="loading model" "model layers"=61 requested=-1

time=2026-04-05T22:42:33.691+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine"

time=2026-04-05T22:42:33.692+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:58499"

time=2026-04-05T22:42:33.694+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:61[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"

time=2026-04-05T22:42:33.764+01:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1189 num_key_values=49

load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no

ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no

ggml_cuda_init: found 1 CUDA devices:

  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-84f27e06-0203-3d52-a176-80d1f45cd22e

load_backend: loaded CUDA backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll

time=2026-04-05T22:42:33.869+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)

time=2026-04-05T22:42:33.877+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883

time=2026-04-05T22:42:33.894+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=4.2428ms bounds=(0,0)-(2048,2048)

time=2026-04-05T22:42:34.023+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=129.4807ms size="[768 768]"

time=2026-04-05T22:42:34.023+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3

time=2026-04-05T22:42:34.023+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16

time=2026-04-05T22:42:34.029+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=140.0853ms shape="[5376 256]"

time=2026-04-05T22:42:34.359+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"

time=2026-04-05T22:42:34.434+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883

time=2026-04-05T22:42:34.444+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.0878ms bounds=(0,0)-(2048,2048)

time=2026-04-05T22:42:34.578+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=133.7935ms size="[768 768]"

time=2026-04-05T22:42:34.578+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3

time=2026-04-05T22:42:34.578+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16

time=2026-04-05T22:42:34.580+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=137.5556ms shape="[5376 256]"

time=2026-04-05T22:42:34.655+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"

time=2026-04-05T22:42:34.758+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883

time=2026-04-05T22:42:34.769+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=533.2A�s bounds=(0,0)-(2048,2048)

time=2026-04-05T22:42:34.904+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=134.9941ms size="[768 768]"

time=2026-04-05T22:42:34.905+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3

time=2026-04-05T22:42:34.905+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16

time=2026-04-05T22:42:34.906+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=137.5916ms shape="[5376 256]"

ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 33554432

time=2026-04-05T22:42:44.469+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:60[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:60(0..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"

time=2026-04-05T22:42:44.677+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883

time=2026-04-05T22:42:44.689+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.0587ms bounds=(0,0)-(2048,2048)

time=2026-04-05T22:42:44.826+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=137.0745ms size="[768 768]"

time=2026-04-05T22:42:44.832+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3

time=2026-04-05T22:42:44.832+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16

time=2026-04-05T22:42:44.833+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=144.7476ms shape="[5376 256]"

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 64.00 MiB on device 0: cudaMalloc failed: out of memory

time=2026-04-05T22:42:57.790+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"

time=2026-04-05T22:42:57.958+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883

time=2026-04-05T22:42:57.970+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=578.4A�s bounds=(0,0)-(2048,2048)

time=2026-04-05T22:42:58.104+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=134.4148ms size="[768 768]"

time=2026-04-05T22:42:58.109+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3

time=2026-04-05T22:42:58.109+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16

time=2026-04-05T22:42:58.110+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=140.4811ms shape="[5376 256]"

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 64.00 MiB on device 0: cudaMalloc failed: out of memory

time=2026-04-05T22:43:08.867+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"

time=2026-04-05T22:43:09.034+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883

time=2026-04-05T22:43:09.046+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.0825ms bounds=(0,0)-(2048,2048)

time=2026-04-05T22:43:09.176+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=129.1483ms size="[768 768]"

time=2026-04-05T22:43:09.184+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3

time=2026-04-05T22:43:09.184+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16

time=2026-04-05T22:43:09.185+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=139.2676ms shape="[5376 256]"

ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 33554432

time=2026-04-05T22:43:09.752+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.10

time=2026-04-05T22:43:09.754+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:60[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:60(0..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"

time=2026-04-05T22:43:09.920+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883

time=2026-04-05T22:43:09.932+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=531.5A�s bounds=(0,0)-(2048,2048)

time=2026-04-05T22:43:10.070+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=137.1554ms size="[768 768]"

time=2026-04-05T22:43:10.072+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3

time=2026-04-05T22:43:10.072+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16

time=2026-04-05T22:43:10.073+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=141.806ms shape="[5376 256]"

ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 33554432

time=2026-04-05T22:43:12.500+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"

time=2026-04-05T22:43:12.664+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883

time=2026-04-05T22:43:12.675+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.0483ms bounds=(0,0)-(2048,2048)

time=2026-04-05T22:43:12.807+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=131.9837ms size="[768 768]"

time=2026-04-05T22:43:12.809+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3

time=2026-04-05T22:43:12.809+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16

time=2026-04-05T22:43:12.811+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=136.6992ms shape="[5376 256]"

ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 33554432

time=2026-04-05T22:43:15.418+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"

time=2026-04-05T22:43:15.760+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883

@Issueposter commented on GitHub (Apr 6, 2026): Complete `server.log` — first load attempt (Ollama 0.20.2 fresh start, 2026-04-05): ``` time=2026-04-05T22:42:12.603+01:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16384 OLLAMA_DEBUG:INFO OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:30m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\patri\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:true OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES:]" time=2026-04-05T22:42:12.603+01:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: true" time=2026-04-05T22:42:12.610+01:00 level=INFO source=images.go:499 msg="total blobs: 18" time=2026-04-05T22:42:12.612+01:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0" time=2026-04-05T22:42:12.613+01:00 level=INFO source=routes.go:1802 msg="Listening on [::]:11434 (version 0.20.2)" time=2026-04-05T22:42:12.615+01:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-04-05T22:42:12.629+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58389" time=2026-04-05T22:42:13.542+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58403" time=2026-04-05T22:42:13.810+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58417" time=2026-04-05T22:42:14.117+01:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled. To enable, set OLLAMA_VULKAN=1" time=2026-04-05T22:42:14.118+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58432" time=2026-04-05T22:42:14.118+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58431" time=2026-04-05T22:42:14.422+01:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e filter_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3090" libdirs=ollama,cuda_v13 driver=13.1 pci_id=0000:09:00.0 type=discrete total="24.0 GiB" available="22.6 GiB" time=2026-04-05T22:42:14.422+01:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="24.0 GiB" default_num_ctx=32768 [GIN] 2026/04/05 - 22:42:14 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2026/04/05 - 22:42:14 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2026/04/05 - 22:42:14 | 200 | 527.1A�s | 127.0.0.1 | GET "/api/version" [GIN] 2026/04/05 - 22:42:14 | 401 | 139.5814ms | 127.0.0.1 | POST "/api/me" [GIN] 2026/04/05 - 22:42:14 | 401 | 139.0543ms | 127.0.0.1 | POST "/api/me" [GIN] 2026/04/05 - 22:42:14 | 200 | 298.269ms | 127.0.0.1 | POST "/api/show" time=2026-04-05T22:42:33.105+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58484" time=2026-04-05T22:42:33.370+01:00 level=INFO source=cpu_windows.go:148 msg=packages count=1 time=2026-04-05T22:42:33.370+01:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=8 efficiency=0 threads=16 time=2026-04-05T22:42:33.564+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-05T22:42:33.564+01:00 level=WARN source=server.go:258 msg="quantized kv cache requested but flash attention disabled" type=q8_0 time=2026-04-05T22:42:33.564+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\patri\\.ollama\\models\\blobs\\sha256-280af6832eca23cb322c4dcc65edfea98a21b8f8ab07dc7553bd6f7e6e7a3313 --port 58499" time=2026-04-05T22:42:33.576+01:00 level=INFO source=sched.go:484 msg="system memory" total="15.9 GiB" free="9.7 GiB" free_swap="21.7 GiB" time=2026-04-05T22:42:33.576+01:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA available="22.1 GiB" free="22.6 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-04-05T22:42:33.576+01:00 level=INFO source=server.go:759 msg="loading model" "model layers"=61 requested=-1 time=2026-04-05T22:42:33.691+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine" time=2026-04-05T22:42:33.692+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:58499" time=2026-04-05T22:42:33.694+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:61[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-05T22:42:33.764+01:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1189 num_key_values=49 load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-84f27e06-0203-3d52-a176-80d1f45cd22e load_backend: loaded CUDA backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2026-04-05T22:42:33.869+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-04-05T22:42:33.877+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-05T22:42:33.894+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=4.2428ms bounds=(0,0)-(2048,2048) time=2026-04-05T22:42:34.023+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=129.4807ms size="[768 768]" time=2026-04-05T22:42:34.023+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-05T22:42:34.023+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-05T22:42:34.029+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=140.0853ms shape="[5376 256]" time=2026-04-05T22:42:34.359+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-05T22:42:34.434+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-05T22:42:34.444+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.0878ms bounds=(0,0)-(2048,2048) time=2026-04-05T22:42:34.578+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=133.7935ms size="[768 768]" time=2026-04-05T22:42:34.578+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-05T22:42:34.578+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-05T22:42:34.580+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=137.5556ms shape="[5376 256]" time=2026-04-05T22:42:34.655+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-05T22:42:34.758+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-05T22:42:34.769+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=533.2A�s bounds=(0,0)-(2048,2048) time=2026-04-05T22:42:34.904+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=134.9941ms size="[768 768]" time=2026-04-05T22:42:34.905+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-05T22:42:34.905+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-05T22:42:34.906+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=137.5916ms shape="[5376 256]" ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 33554432 time=2026-04-05T22:42:44.469+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:60[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:60(0..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-05T22:42:44.677+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-05T22:42:44.689+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.0587ms bounds=(0,0)-(2048,2048) time=2026-04-05T22:42:44.826+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=137.0745ms size="[768 768]" time=2026-04-05T22:42:44.832+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-05T22:42:44.832+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-05T22:42:44.833+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=144.7476ms shape="[5376 256]" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 64.00 MiB on device 0: cudaMalloc failed: out of memory time=2026-04-05T22:42:57.790+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-05T22:42:57.958+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-05T22:42:57.970+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=578.4A�s bounds=(0,0)-(2048,2048) time=2026-04-05T22:42:58.104+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=134.4148ms size="[768 768]" time=2026-04-05T22:42:58.109+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-05T22:42:58.109+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-05T22:42:58.110+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=140.4811ms shape="[5376 256]" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 64.00 MiB on device 0: cudaMalloc failed: out of memory time=2026-04-05T22:43:08.867+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-05T22:43:09.034+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-05T22:43:09.046+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.0825ms bounds=(0,0)-(2048,2048) time=2026-04-05T22:43:09.176+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=129.1483ms size="[768 768]" time=2026-04-05T22:43:09.184+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-05T22:43:09.184+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-05T22:43:09.185+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=139.2676ms shape="[5376 256]" ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 33554432 time=2026-04-05T22:43:09.752+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.10 time=2026-04-05T22:43:09.754+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:60[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:60(0..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-05T22:43:09.920+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-05T22:43:09.932+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=531.5A�s bounds=(0,0)-(2048,2048) time=2026-04-05T22:43:10.070+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=137.1554ms size="[768 768]" time=2026-04-05T22:43:10.072+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-05T22:43:10.072+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-05T22:43:10.073+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=141.806ms shape="[5376 256]" ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 33554432 time=2026-04-05T22:43:12.500+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-05T22:43:12.664+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-05T22:43:12.675+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.0483ms bounds=(0,0)-(2048,2048) time=2026-04-05T22:43:12.807+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=131.9837ms size="[768 768]" time=2026-04-05T22:43:12.809+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-05T22:43:12.809+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-05T22:43:12.811+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=136.6992ms shape="[5376 256]" ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 33554432 time=2026-04-05T22:43:15.418+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-05T22:43:15.760+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 ```

GiteaMirror commented

2026-04-22 20:10:31 -05:00

@rick-github commented on GitHub (Apr 6, 2026):

Set OLLAMA_DEBUG=2 in the server environment and post the complete log, from the server config line through to the GIN line that returns a 500 to the client.

@rick-github commented on GitHub (Apr 6, 2026): Set `OLLAMA_DEBUG=2` in the server environment and post the complete log, from the `server config` line through to the GIN line that returns a 500 to the client.

GiteaMirror commented

2026-04-22 20:10:32 -05:00

@Issueposter commented on GitHub (Apr 6, 2026):

Complete server.log with OLLAMA_DEBUG=2 — part 1/2
(TRACE-level tensor enumeration lines omitted — 23,731 entries of created tensor/found tensor/layer to assign at DEBUG-4. Available on request.)

time=2026-04-06T09:21:39.572+01:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16384 OLLAMA_DEBUG:DEBUG-4 OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:30m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\patri\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:true OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES:]"
time=2026-04-06T09:21:39.572+01:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: true"
time=2026-04-06T09:21:39.575+01:00 level=INFO source=images.go:499 msg="total blobs: 18"
time=2026-04-06T09:21:39.577+01:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0"
time=2026-04-06T09:21:39.579+01:00 level=INFO source=routes.go:1802 msg="Listening on [::]:11434 (version 0.20.2)"
time=2026-04-06T09:21:39.579+01:00 level=DEBUG source=sched.go:145 msg="starting llm scheduler"
time=2026-04-06T09:21:39.581+01:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-04-06T09:21:39.594+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64242"
time=2026-04-06T09:21:39.594+01:00 level=DEBUG source=server.go:433 msg=subprocess OLLAMA_NO_CLOUD=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=30m PATH="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\dotnet\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\CMake\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\patri\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\patri\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\patri\\.local\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama;C:\\Users\\patri\\AppData\\Local\\Python\\bin;C:\\Users\\patri\\AppData\\Roaming\\npm" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_DEBUG=2 OLLAMA_MODELS=C:\Users\patri\.ollama\models OLLAMA_NUM_CTX=8192 OLLAMA_FLASH_ATTENTION=0 OLLAMA_HOST=0.0.0.0 OLLAMA_LIBRARY_PATH=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13
time=2026-04-06T09:21:39.728+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
time=2026-04-06T09:21:39.729+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:64242"
time=2026-04-06T09:21:39.739+01:00 level=DEBUG source=gguf.go:604 msg=general.architecture type=string
time=2026-04-06T09:21:39.739+01:00 level=DEBUG source=gguf.go:604 msg=tokenizer.ggml.model type=string
time=2026-04-06T09:21:39.740+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:21:39.740+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:21:39.740+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.file_type default=0
time=2026-04-06T09:21:39.740+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default=""
time=2026-04-06T09:21:39.740+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default=""
time=2026-04-06T09:21:39.740+01:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3
time=2026-04-06T09:21:39.740+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
time=2026-04-06T09:21:39.757+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-84f27e06-0203-3d52-a176-80d1f45cd22e
load_backend: loaded CUDA backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2026-04-06T09:21:39.845+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.pooling_type default=0
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.expert_count default=0
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}"
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}"
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}"
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}"
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.pre default=""
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.embedding_length default=0
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count default=0
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count_kv default=0
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.key_length default=0
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.dimension_count default=0
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.freq_base default=100000
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.scaling.factor default=1
time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=runner.go:1392 msg="dummy model load took" duration=107.6371ms
ggml_backend_cuda_device_get_memory device GPU-84f27e06-0203-3d52-a176-80d1f45cd22e utilizing NVML memory reporting free: 24246181888 total: 25769803776
time=2026-04-06T09:21:39.874+01:00 level=DEBUG source=runner.go:1397 msg="gathering device infos took" duration=27.0661ms
time=2026-04-06T09:21:39.875+01:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=289.5856ms OLLAMA_LIBRARY_PATH="[C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[]
time=2026-04-06T09:21:39.876+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64257"
time=2026-04-06T09:21:39.876+01:00 level=DEBUG source=server.go:433 msg=subprocess OLLAMA_NO_CLOUD=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=30m PATH="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\rocm;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\dotnet\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\CMake\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\patri\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\patri\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\patri\\.local\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama;C:\\Users\\patri\\AppData\\Local\\Python\\bin;C:\\Users\\patri\\AppData\\Roaming\\npm" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_DEBUG=2 OLLAMA_MODELS=C:\Users\patri\.ollama\models OLLAMA_NUM_CTX=8192 OLLAMA_FLASH_ATTENTION=0 OLLAMA_HOST=0.0.0.0 OLLAMA_LIBRARY_PATH=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\rocm
time=2026-04-06T09:21:40.017+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
time=2026-04-06T09:21:40.018+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:64257"
time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=gguf.go:604 msg=general.architecture type=string
time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=gguf.go:604 msg=tokenizer.ggml.model type=string
time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.file_type default=0
time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default=""
time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default=""
time=2026-04-06T09:21:40.021+01:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3
time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
time=2026-04-06T09:21:40.037+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\rocm
ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected
load_backend: loaded ROCm backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\rocm\ggml-hip.dll
time=2026-04-06T09:21:40.204+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0
time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.pooling_type default=0
time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.expert_count default=0
time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}"
time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}"
time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}"
time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}"
time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true
time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0
time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.pre default=""
time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0
time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.embedding_length default=0
time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count default=0
time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count_kv default=0
time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.key_length default=0
time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.dimension_count default=0
time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0
time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.freq_base default=100000
time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.scaling.factor default=1
time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=runner.go:1392 msg="dummy model load took" duration=185.457ms
time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=runner.go:1397 msg="gathering device infos took" duration=0s
time=2026-04-06T09:21:40.207+01:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=331.8067ms OLLAMA_LIBRARY_PATH="[C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\rocm]" extra_envs=map[]
time=2026-04-06T09:21:40.207+01:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"
time=2026-04-06T09:21:40.208+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64271"
time=2026-04-06T09:21:40.208+01:00 level=DEBUG source=server.go:433 msg=subprocess OLLAMA_NO_CLOUD=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=30m PATH="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\dotnet\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\CMake\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\patri\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\patri\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\patri\\.local\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama;C:\\Users\\patri\\AppData\\Local\\Python\\bin;C:\\Users\\patri\\AppData\\Roaming\\npm" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_DEBUG=2 OLLAMA_MODELS=C:\Users\patri\.ollama\models OLLAMA_NUM_CTX=8192 OLLAMA_FLASH_ATTENTION=0 OLLAMA_HOST=0.0.0.0 OLLAMA_LIBRARY_PATH=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
time=2026-04-06T09:21:40.347+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
time=2026-04-06T09:21:40.348+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:64271"
time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=gguf.go:604 msg=general.architecture type=string
time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=gguf.go:604 msg=tokenizer.ggml.model type=string
time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.file_type default=0
time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default=""
time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default=""
time=2026-04-06T09:21:40.352+01:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3
time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
time=2026-04-06T09:21:40.369+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-84f27e06-0203-3d52-a176-80d1f45cd22e
load_backend: loaded CUDA backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2026-04-06T09:21:41.100+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-04-06T09:21:41.100+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.pooling_type default=0
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.expert_count default=0
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.pre default=""
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.embedding_length default=0
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count default=0
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count_kv default=0
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.key_length default=0
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.dimension_count default=0
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.freq_base default=100000
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.scaling.factor default=1
time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=runner.go:1392 msg="dummy model load took" duration=749.1178ms
ggml_backend_cuda_device_get_memory device GPU-84f27e06-0203-3d52-a176-80d1f45cd22e utilizing NVML memory reporting free: 24246181888 total: 25769803776
time=2026-04-06T09:21:41.127+01:00 level=DEBUG source=runner.go:1397 msg="gathering device infos took" duration=25.2978ms
time=2026-04-06T09:21:41.127+01:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=920.4301ms OLLAMA_LIBRARY_PATH="[C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12]" extra_envs=map[]
time=2026-04-06T09:21:41.127+01:00 level=DEBUG source=runner.go:124 msg="evaluating which, if any, devices to filter out" initial_count=2
time=2026-04-06T09:21:41.127+01:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13 description="NVIDIA GeForce RTX 3090" compute=8.6 id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e pci_id=0000:09:00.0
time=2026-04-06T09:21:41.127+01:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 description="NVIDIA GeForce RTX 3090" compute=8.6 id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e pci_id=0000:09:00.0
time=2026-04-06T09:21:41.129+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64287"
time=2026-04-06T09:21:41.129+01:00 level=DEBUG source=server.go:433 msg=subprocess OLLAMA_NO_CLOUD=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=30m PATH="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\dotnet\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\CMake\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\patri\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\patri\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\patri\\.local\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama;C:\\Users\\patri\\AppData\\Local\\Python\\bin;C:\\Users\\patri\\AppData\\Roaming\\npm" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_DEBUG=2 OLLAMA_MODELS=C:\Users\patri\.ollama\models OLLAMA_NUM_CTX=8192 OLLAMA_FLASH_ATTENTION=0 OLLAMA_HOST=0.0.0.0 OLLAMA_LIBRARY_PATH=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13 CUDA_VISIBLE_DEVICES=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e GGML_CUDA_INIT=1
time=2026-04-06T09:21:41.129+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64286"
time=2026-04-06T09:21:41.129+01:00 level=DEBUG source=server.go:433 msg=subprocess OLLAMA_NO_CLOUD=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=30m PATH="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\dotnet\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\CMake\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\patri\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\patri\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\patri\\.local\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama;C:\\Users\\patri\\AppData\\Local\\Python\\bin;C:\\Users\\patri\\AppData\\Roaming\\npm" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_DEBUG=2 OLLAMA_MODELS=C:\Users\patri\.ollama\models OLLAMA_NUM_CTX=8192 OLLAMA_FLASH_ATTENTION=0 OLLAMA_HOST=0.0.0.0 OLLAMA_LIBRARY_PATH=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 GGML_CUDA_INIT=1 CUDA_VISIBLE_DEVICES=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e
time=2026-04-06T09:21:41.268+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
time=2026-04-06T09:21:41.269+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:64287"
time=2026-04-06T09:21:41.274+01:00 level=DEBUG source=gguf.go:604 msg=general.architecture type=string
time=2026-04-06T09:21:41.274+01:00 level=DEBUG source=gguf.go:604 msg=tokenizer.ggml.model type=string
time=2026-04-06T09:21:41.274+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:21:41.275+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:21:41.275+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.file_type default=0
time=2026-04-06T09:21:41.275+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default=""
time=2026-04-06T09:21:41.275+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default=""
time=2026-04-06T09:21:41.275+01:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3
time=2026-04-06T09:21:41.275+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
time=2026-04-06T09:21:41.292+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13
time=2026-04-06T09:21:41.347+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
time=2026-04-06T09:21:41.348+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:64286"
time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=gguf.go:604 msg=general.architecture type=string
time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=gguf.go:604 msg=tokenizer.ggml.model type=string
time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.file_type default=0
time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default=""
time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default=""
time=2026-04-06T09:21:41.355+01:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3
time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
time=2026-04-06T09:21:41.372+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-84f27e06-0203-3d52-a176-80d1f45cd22e
load_backend: loaded CUDA backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2026-04-06T09:21:41.387+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.pooling_type default=0
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.expert_count default=0
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.pre default=""
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.embedding_length default=0
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count default=0
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count_kv default=0
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.key_length default=0
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.dimension_count default=0
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.freq_base default=100000
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.scaling.factor default=1
time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=runner.go:1392 msg="dummy model load took" duration=113.6651ms
ggml_backend_cuda_device_get_memory device GPU-84f27e06-0203-3d52-a176-80d1f45cd22e utilizing NVML memory reporting free: 23978442752 total: 25769803776
time=2026-04-06T09:21:41.407+01:00 level=DEBUG source=runner.go:1397 msg="gathering device infos took" duration=19.8969ms
time=2026-04-06T09:21:41.409+01:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=281.7852ms OLLAMA_LIBRARY_PATH="[C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13]" extra_envs="map[CUDA_VISIBLE_DEVICES:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e GGML_CUDA_INIT:1]"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-84f27e06-0203-3d52-a176-80d1f45cd22e
load_backend: loaded CUDA backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2026-04-06T09:21:41.475+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-04-06T09:21:41.475+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0
time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.pooling_type default=0
time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.expert_count default=0
time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true
time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0
time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.pre default=""
time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0
time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.embedding_length default=0
time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count default=0
time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count_kv default=0
time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.key_length default=0
time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.dimension_count default=0
time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0
time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.freq_base default=100000
time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.scaling.factor default=1
time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=runner.go:1392 msg="dummy model load took" duration=122.4227ms
ggml_backend_cuda_device_get_memory device GPU-84f27e06-0203-3d52-a176-80d1f45cd22e utilizing NVML memory reporting free: 24246181888 total: 25769803776
time=2026-04-06T09:21:41.495+01:00 level=DEBUG source=runner.go:1397 msg="gathering device infos took" duration=18.8453ms
time=2026-04-06T09:21:41.496+01:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=369.5165ms OLLAMA_LIBRARY_PATH="[C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12]" extra_envs="map[CUDA_VISIBLE_DEVICES:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e GGML_CUDA_INIT:1]"
time=2026-04-06T09:21:41.496+01:00 level=DEBUG source=runner.go:401 msg="filtering device with overlapping libraries" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 delete_index=1 kept_library=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13
time=2026-04-06T09:21:41.496+01:00 level=DEBUG source=runner.go:40 msg="GPU bootstrap discovery took" duration=1.917119s
time=2026-04-06T09:21:41.496+01:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e filter_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3090" libdirs=ollama,cuda_v13 driver=13.1 pci_id=0000:09:00.0 type=discrete total="24.0 GiB" available="22.6 GiB"
time=2026-04-06T09:21:41.496+01:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="24.0 GiB" default_num_ctx=32768
[GIN] 2026/04/06 - 09:21:41 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
[GIN] 2026/04/06 - 09:21:41 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
[GIN] 2026/04/06 - 09:21:41 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
time=2026-04-06T09:21:41.798+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
[GIN] 2026/04/06 - 09:21:41 | 200 |    308.0272ms |       127.0.0.1 | POST     "/api/show"
time=2026-04-06T09:25:55.225+01:00 level=DEBUG source=runner.go:264 msg="refreshing free memory"
time=2026-04-06T09:25:55.225+01:00 level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery"
time=2026-04-06T09:25:55.234+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51196"
time=2026-04-06T09:25:55.234+01:00 level=DEBUG source=server.go:433 msg=subprocess OLLAMA_NO_CLOUD=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=30m PATH="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\dotnet\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\CMake\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\patri\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\patri\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\patri\\.local\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama;C:\\Users\\patri\\AppData\\Local\\Python\\bin;C:\\Users\\patri\\AppData\\Roaming\\npm" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_DEBUG=2 OLLAMA_MODELS=C:\Users\patri\.ollama\models OLLAMA_NUM_CTX=8192 OLLAMA_FLASH_ATTENTION=0 OLLAMA_HOST=0.0.0.0 OLLAMA_LIBRARY_PATH=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13
time=2026-04-06T09:25:55.388+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
time=2026-04-06T09:25:55.389+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:51196"
time=2026-04-06T09:25:55.399+01:00 level=DEBUG source=gguf.go:604 msg=general.architecture type=string
time=2026-04-06T09:25:55.399+01:00 level=DEBUG source=gguf.go:604 msg=tokenizer.ggml.model type=string
time=2026-04-06T09:25:55.400+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:25:55.400+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:25:55.400+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.file_type default=0
time=2026-04-06T09:25:55.400+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default=""
time=2026-04-06T09:25:55.400+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default=""
time=2026-04-06T09:25:55.400+01:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3
time=2026-04-06T09:25:55.400+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
time=2026-04-06T09:25:55.416+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-84f27e06-0203-3d52-a176-80d1f45cd22e
load_backend: loaded CUDA backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2026-04-06T09:25:55.505+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.pooling_type default=0
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.expert_count default=0
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}"
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}"
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}"
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}"
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.pre default=""
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.embedding_length default=0
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count default=0
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count_kv default=0
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.key_length default=0
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.dimension_count default=0
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.freq_base default=100000
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.scaling.factor default=1
time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=runner.go:1392 msg="dummy model load took" duration=106.2298ms
ggml_backend_cuda_device_get_memory device GPU-84f27e06-0203-3d52-a176-80d1f45cd22e utilizing NVML memory reporting free: 24251162624 total: 25769803776
time=2026-04-06T09:25:55.525+01:00 level=DEBUG source=runner.go:1397 msg="gathering device infos took" duration=20.3965ms
time=2026-04-06T09:25:55.526+01:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=301.1449ms OLLAMA_LIBRARY_PATH="[C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[]
time=2026-04-06T09:25:55.526+01:00 level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=301.1449ms
time=2026-04-06T09:25:55.526+01:00 level=INFO source=cpu_windows.go:148 msg=packages count=1
time=2026-04-06T09:25:55.527+01:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=8 efficiency=0 threads=16
time=2026-04-06T09:25:55.527+01:00 level=DEBUG source=sched.go:220 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2026-04-06T09:25:55.527+01:00 level=DEBUG source=sched.go:229 msg="loading first model" model=C:\Users\patri\.ollama\models\blobs\sha256-280af6832eca23cb322c4dcc65edfea98a21b8f8ab07dc7553bd6f7e6e7a3313
time=2026-04-06T09:25:55.657+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0
time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106
time=2026-04-06T09:25:55.730+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_count_kv default=0
time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0
time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count default=0
time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.block_count default=0
time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.embedding_length default=0
time=2026-04-06T09:25:55.730+01:00 level=WARN source=server.go:258 msg="quantized kv cache requested but flash attention disabled" type=q8_0
time=2026-04-06T09:25:55.730+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\patri\\.ollama\\models\\blobs\\sha256-280af6832eca23cb322c4dcc65edfea98a21b8f8ab07dc7553bd6f7e6e7a3313 --port 51213"
time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=server.go:433 msg=subprocess OLLAMA_NO_CLOUD=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=30m PATH="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\dotnet\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\CMake\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\patri\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\patri\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\patri\\.local\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama;C:\\Users\\patri\\AppData\\Local\\Python\\bin;C:\\Users\\patri\\AppData\\Roaming\\npm" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_DEBUG=2 OLLAMA_MODELS=C:\Users\patri\.ollama\models OLLAMA_NUM_CTX=8192 OLLAMA_FLASH_ATTENTION=0 OLLAMA_HOST=0.0.0.0 OLLAMA_LIBRARY_PATH=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13
time=2026-04-06T09:25:55.739+01:00 level=INFO source=sched.go:484 msg="system memory" total="15.9 GiB" free="10.2 GiB" free_swap="19.7 GiB"
time=2026-04-06T09:25:55.739+01:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA available="22.1 GiB" free="22.6 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-04-06T09:25:55.739+01:00 level=INFO source=server.go:759 msg="loading model" "model layers"=61 requested=-1
time=2026-04-06T09:25:55.863+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
time=2026-04-06T09:25:55.864+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:51213"
time=2026-04-06T09:25:55.875+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:61[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:25:55.950+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:25:55.955+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default=""
time=2026-04-06T09:25:55.955+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default=""
time=2026-04-06T09:25:55.955+01:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1189 num_key_values=49
time=2026-04-06T09:25:55.955+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
time=2026-04-06T09:25:55.972+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-84f27e06-0203-3d52-a176-80d1f45cd22e
load_backend: loaded CUDA backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2026-04-06T09:25:56.075+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-04-06T09:25:56.095+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0
time=2026-04-06T09:25:56.095+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106
time=2026-04-06T09:25:56.095+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-06T09:25:56.095+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_count_kv default=0
time=2026-04-06T09:25:56.095+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0
time=2026-04-06T09:25:56.095+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count default=0
time=2026-04-06T09:25:56.095+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.block_count default=0
time=2026-04-06T09:25:56.095+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.embedding_length default=0
time=2026-04-06T09:25:56.125+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=3.1229ms bounds=(0,0)-(2048,2048)
time=2026-04-06T09:25:56.269+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=144.0074ms size="[768 768]"
time=2026-04-06T09:25:56.269+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
time=2026-04-06T09:25:56.269+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=2026-04-06T09:25:56.274+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=152.1374ms shape="[5376 256]"
time=2026-04-06T09:25:56.381+01:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1272 splits=1
time=2026-04-06T09:25:56.536+01:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2752 splits=2
time=2026-04-06T09:25:56.583+01:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2750 splits=2
time=2026-04-06T09:25:56.588+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="18.4 GiB"
time=2026-04-06T09:25:56.588+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.2 GiB"
time=2026-04-06T09:25:56.588+01:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="4.8 GiB"
time=2026-04-06T09:25:56.588+01:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.7 GiB"
time=2026-04-06T09:25:56.588+01:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="10.5 MiB"
time=2026-04-06T09:25:56.588+01:00 level=DEBUG source=device.go:272 msg="total memory" size="26.0 GiB"

@Issueposter commented on GitHub (Apr 6, 2026): Complete `server.log` with `OLLAMA_DEBUG=2` — part 1/2 (TRACE-level tensor enumeration lines omitted — 23,731 entries of `created tensor`/`found tensor`/`layer to assign` at DEBUG-4. Available on request.) ``` time=2026-04-06T09:21:39.572+01:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16384 OLLAMA_DEBUG:DEBUG-4 OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:30m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\patri\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:true OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES:]" time=2026-04-06T09:21:39.572+01:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: true" time=2026-04-06T09:21:39.575+01:00 level=INFO source=images.go:499 msg="total blobs: 18" time=2026-04-06T09:21:39.577+01:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0" time=2026-04-06T09:21:39.579+01:00 level=INFO source=routes.go:1802 msg="Listening on [::]:11434 (version 0.20.2)" time=2026-04-06T09:21:39.579+01:00 level=DEBUG source=sched.go:145 msg="starting llm scheduler" time=2026-04-06T09:21:39.581+01:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-04-06T09:21:39.594+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64242" time=2026-04-06T09:21:39.594+01:00 level=DEBUG source=server.go:433 msg=subprocess OLLAMA_NO_CLOUD=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=30m PATH="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\dotnet\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\CMake\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\patri\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\patri\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\patri\\.local\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama;C:\\Users\\patri\\AppData\\Local\\Python\\bin;C:\\Users\\patri\\AppData\\Roaming\\npm" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_DEBUG=2 OLLAMA_MODELS=C:\Users\patri\.ollama\models OLLAMA_NUM_CTX=8192 OLLAMA_FLASH_ATTENTION=0 OLLAMA_HOST=0.0.0.0 OLLAMA_LIBRARY_PATH=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13 time=2026-04-06T09:21:39.728+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine" time=2026-04-06T09:21:39.729+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:64242" time=2026-04-06T09:21:39.739+01:00 level=DEBUG source=gguf.go:604 msg=general.architecture type=string time=2026-04-06T09:21:39.739+01:00 level=DEBUG source=gguf.go:604 msg=tokenizer.ggml.model type=string time=2026-04-06T09:21:39.740+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:21:39.740+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:21:39.740+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.file_type default=0 time=2026-04-06T09:21:39.740+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default="" time=2026-04-06T09:21:39.740+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default="" time=2026-04-06T09:21:39.740+01:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3 time=2026-04-06T09:21:39.740+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll time=2026-04-06T09:21:39.757+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-84f27e06-0203-3d52-a176-80d1f45cd22e load_backend: loaded CUDA backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2026-04-06T09:21:39.845+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0 time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.pooling_type default=0 time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.expert_count default=0 time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}" time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}" time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}" time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}" time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0 time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.pre default="" time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0 time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.embedding_length default=0 time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count default=0 time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count_kv default=0 time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.key_length default=0 time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.dimension_count default=0 time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0 time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.freq_base default=100000 time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.scaling.factor default=1 time=2026-04-06T09:21:39.846+01:00 level=DEBUG source=runner.go:1392 msg="dummy model load took" duration=107.6371ms ggml_backend_cuda_device_get_memory device GPU-84f27e06-0203-3d52-a176-80d1f45cd22e utilizing NVML memory reporting free: 24246181888 total: 25769803776 time=2026-04-06T09:21:39.874+01:00 level=DEBUG source=runner.go:1397 msg="gathering device infos took" duration=27.0661ms time=2026-04-06T09:21:39.875+01:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=289.5856ms OLLAMA_LIBRARY_PATH="[C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[] time=2026-04-06T09:21:39.876+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64257" time=2026-04-06T09:21:39.876+01:00 level=DEBUG source=server.go:433 msg=subprocess OLLAMA_NO_CLOUD=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=30m PATH="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\rocm;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\dotnet\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\CMake\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\patri\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\patri\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\patri\\.local\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama;C:\\Users\\patri\\AppData\\Local\\Python\\bin;C:\\Users\\patri\\AppData\\Roaming\\npm" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_DEBUG=2 OLLAMA_MODELS=C:\Users\patri\.ollama\models OLLAMA_NUM_CTX=8192 OLLAMA_FLASH_ATTENTION=0 OLLAMA_HOST=0.0.0.0 OLLAMA_LIBRARY_PATH=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\rocm time=2026-04-06T09:21:40.017+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine" time=2026-04-06T09:21:40.018+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:64257" time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=gguf.go:604 msg=general.architecture type=string time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=gguf.go:604 msg=tokenizer.ggml.model type=string time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.file_type default=0 time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default="" time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default="" time=2026-04-06T09:21:40.021+01:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3 time=2026-04-06T09:21:40.021+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll time=2026-04-06T09:21:40.037+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\rocm ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected load_backend: loaded ROCm backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\rocm\ggml-hip.dll time=2026-04-06T09:21:40.204+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang) time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0 time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.pooling_type default=0 time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.expert_count default=0 time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}" time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}" time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}" time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}" time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0 time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2026-04-06T09:21:40.204+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.pre default="" time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0 time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.embedding_length default=0 time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count default=0 time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count_kv default=0 time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.key_length default=0 time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.dimension_count default=0 time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0 time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.freq_base default=100000 time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.scaling.factor default=1 time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=runner.go:1392 msg="dummy model load took" duration=185.457ms time=2026-04-06T09:21:40.205+01:00 level=DEBUG source=runner.go:1397 msg="gathering device infos took" duration=0s time=2026-04-06T09:21:40.207+01:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=331.8067ms OLLAMA_LIBRARY_PATH="[C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\rocm]" extra_envs=map[] time=2026-04-06T09:21:40.207+01:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled. To enable, set OLLAMA_VULKAN=1" time=2026-04-06T09:21:40.208+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64271" time=2026-04-06T09:21:40.208+01:00 level=DEBUG source=server.go:433 msg=subprocess OLLAMA_NO_CLOUD=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=30m PATH="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\dotnet\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\CMake\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\patri\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\patri\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\patri\\.local\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama;C:\\Users\\patri\\AppData\\Local\\Python\\bin;C:\\Users\\patri\\AppData\\Roaming\\npm" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_DEBUG=2 OLLAMA_MODELS=C:\Users\patri\.ollama\models OLLAMA_NUM_CTX=8192 OLLAMA_FLASH_ATTENTION=0 OLLAMA_HOST=0.0.0.0 OLLAMA_LIBRARY_PATH=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 time=2026-04-06T09:21:40.347+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine" time=2026-04-06T09:21:40.348+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:64271" time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=gguf.go:604 msg=general.architecture type=string time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=gguf.go:604 msg=tokenizer.ggml.model type=string time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.file_type default=0 time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default="" time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default="" time=2026-04-06T09:21:40.352+01:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3 time=2026-04-06T09:21:40.352+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll time=2026-04-06T09:21:40.369+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-84f27e06-0203-3d52-a176-80d1f45cd22e load_backend: loaded CUDA backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2026-04-06T09:21:41.100+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-04-06T09:21:41.100+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0 time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.pooling_type default=0 time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.expert_count default=0 time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}" time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}" time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}" time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}" time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0 time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.pre default="" time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0 time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.embedding_length default=0 time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count default=0 time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count_kv default=0 time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.key_length default=0 time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.dimension_count default=0 time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0 time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.freq_base default=100000 time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.scaling.factor default=1 time=2026-04-06T09:21:41.101+01:00 level=DEBUG source=runner.go:1392 msg="dummy model load took" duration=749.1178ms ggml_backend_cuda_device_get_memory device GPU-84f27e06-0203-3d52-a176-80d1f45cd22e utilizing NVML memory reporting free: 24246181888 total: 25769803776 time=2026-04-06T09:21:41.127+01:00 level=DEBUG source=runner.go:1397 msg="gathering device infos took" duration=25.2978ms time=2026-04-06T09:21:41.127+01:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=920.4301ms OLLAMA_LIBRARY_PATH="[C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12]" extra_envs=map[] time=2026-04-06T09:21:41.127+01:00 level=DEBUG source=runner.go:124 msg="evaluating which, if any, devices to filter out" initial_count=2 time=2026-04-06T09:21:41.127+01:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13 description="NVIDIA GeForce RTX 3090" compute=8.6 id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e pci_id=0000:09:00.0 time=2026-04-06T09:21:41.127+01:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 description="NVIDIA GeForce RTX 3090" compute=8.6 id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e pci_id=0000:09:00.0 time=2026-04-06T09:21:41.129+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64287" time=2026-04-06T09:21:41.129+01:00 level=DEBUG source=server.go:433 msg=subprocess OLLAMA_NO_CLOUD=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=30m PATH="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\dotnet\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\CMake\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\patri\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\patri\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\patri\\.local\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama;C:\\Users\\patri\\AppData\\Local\\Python\\bin;C:\\Users\\patri\\AppData\\Roaming\\npm" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_DEBUG=2 OLLAMA_MODELS=C:\Users\patri\.ollama\models OLLAMA_NUM_CTX=8192 OLLAMA_FLASH_ATTENTION=0 OLLAMA_HOST=0.0.0.0 OLLAMA_LIBRARY_PATH=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13 CUDA_VISIBLE_DEVICES=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e GGML_CUDA_INIT=1 time=2026-04-06T09:21:41.129+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64286" time=2026-04-06T09:21:41.129+01:00 level=DEBUG source=server.go:433 msg=subprocess OLLAMA_NO_CLOUD=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=30m PATH="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\dotnet\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\CMake\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\patri\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\patri\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\patri\\.local\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama;C:\\Users\\patri\\AppData\\Local\\Python\\bin;C:\\Users\\patri\\AppData\\Roaming\\npm" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_DEBUG=2 OLLAMA_MODELS=C:\Users\patri\.ollama\models OLLAMA_NUM_CTX=8192 OLLAMA_FLASH_ATTENTION=0 OLLAMA_HOST=0.0.0.0 OLLAMA_LIBRARY_PATH=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 GGML_CUDA_INIT=1 CUDA_VISIBLE_DEVICES=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e time=2026-04-06T09:21:41.268+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine" time=2026-04-06T09:21:41.269+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:64287" time=2026-04-06T09:21:41.274+01:00 level=DEBUG source=gguf.go:604 msg=general.architecture type=string time=2026-04-06T09:21:41.274+01:00 level=DEBUG source=gguf.go:604 msg=tokenizer.ggml.model type=string time=2026-04-06T09:21:41.274+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:21:41.275+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:21:41.275+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.file_type default=0 time=2026-04-06T09:21:41.275+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default="" time=2026-04-06T09:21:41.275+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default="" time=2026-04-06T09:21:41.275+01:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3 time=2026-04-06T09:21:41.275+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll time=2026-04-06T09:21:41.292+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13 time=2026-04-06T09:21:41.347+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine" time=2026-04-06T09:21:41.348+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:64286" time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=gguf.go:604 msg=general.architecture type=string time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=gguf.go:604 msg=tokenizer.ggml.model type=string time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.file_type default=0 time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default="" time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default="" time=2026-04-06T09:21:41.355+01:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3 time=2026-04-06T09:21:41.355+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll time=2026-04-06T09:21:41.372+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-84f27e06-0203-3d52-a176-80d1f45cd22e load_backend: loaded CUDA backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2026-04-06T09:21:41.387+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0 time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.pooling_type default=0 time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.expert_count default=0 time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}" time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}" time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}" time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}" time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0 time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.pre default="" time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0 time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.embedding_length default=0 time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count default=0 time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count_kv default=0 time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.key_length default=0 time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.dimension_count default=0 time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0 time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.freq_base default=100000 time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.scaling.factor default=1 time=2026-04-06T09:21:41.387+01:00 level=DEBUG source=runner.go:1392 msg="dummy model load took" duration=113.6651ms ggml_backend_cuda_device_get_memory device GPU-84f27e06-0203-3d52-a176-80d1f45cd22e utilizing NVML memory reporting free: 23978442752 total: 25769803776 time=2026-04-06T09:21:41.407+01:00 level=DEBUG source=runner.go:1397 msg="gathering device infos took" duration=19.8969ms time=2026-04-06T09:21:41.409+01:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=281.7852ms OLLAMA_LIBRARY_PATH="[C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13]" extra_envs="map[CUDA_VISIBLE_DEVICES:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e GGML_CUDA_INIT:1]" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-84f27e06-0203-3d52-a176-80d1f45cd22e load_backend: loaded CUDA backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2026-04-06T09:21:41.475+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-04-06T09:21:41.475+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0 time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.pooling_type default=0 time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.expert_count default=0 time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}" time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}" time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}" time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}" time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0 time=2026-04-06T09:21:41.476+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.pre default="" time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0 time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.embedding_length default=0 time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count default=0 time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count_kv default=0 time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.key_length default=0 time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.dimension_count default=0 time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0 time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.freq_base default=100000 time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.scaling.factor default=1 time=2026-04-06T09:21:41.477+01:00 level=DEBUG source=runner.go:1392 msg="dummy model load took" duration=122.4227ms ggml_backend_cuda_device_get_memory device GPU-84f27e06-0203-3d52-a176-80d1f45cd22e utilizing NVML memory reporting free: 24246181888 total: 25769803776 time=2026-04-06T09:21:41.495+01:00 level=DEBUG source=runner.go:1397 msg="gathering device infos took" duration=18.8453ms time=2026-04-06T09:21:41.496+01:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=369.5165ms OLLAMA_LIBRARY_PATH="[C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12]" extra_envs="map[CUDA_VISIBLE_DEVICES:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e GGML_CUDA_INIT:1]" time=2026-04-06T09:21:41.496+01:00 level=DEBUG source=runner.go:401 msg="filtering device with overlapping libraries" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 delete_index=1 kept_library=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13 time=2026-04-06T09:21:41.496+01:00 level=DEBUG source=runner.go:40 msg="GPU bootstrap discovery took" duration=1.917119s time=2026-04-06T09:21:41.496+01:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e filter_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3090" libdirs=ollama,cuda_v13 driver=13.1 pci_id=0000:09:00.0 type=discrete total="24.0 GiB" available="22.6 GiB" time=2026-04-06T09:21:41.496+01:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="24.0 GiB" default_num_ctx=32768 [GIN] 2026/04/06 - 09:21:41 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2026/04/06 - 09:21:41 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2026/04/06 - 09:21:41 | 200 | 0s | 127.0.0.1 | GET "/api/version" time=2026-04-06T09:21:41.798+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 [GIN] 2026/04/06 - 09:21:41 | 200 | 308.0272ms | 127.0.0.1 | POST "/api/show" time=2026-04-06T09:25:55.225+01:00 level=DEBUG source=runner.go:264 msg="refreshing free memory" time=2026-04-06T09:25:55.225+01:00 level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery" time=2026-04-06T09:25:55.234+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51196" time=2026-04-06T09:25:55.234+01:00 level=DEBUG source=server.go:433 msg=subprocess OLLAMA_NO_CLOUD=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=30m PATH="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\dotnet\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\CMake\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\patri\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\patri\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\patri\\.local\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama;C:\\Users\\patri\\AppData\\Local\\Python\\bin;C:\\Users\\patri\\AppData\\Roaming\\npm" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_DEBUG=2 OLLAMA_MODELS=C:\Users\patri\.ollama\models OLLAMA_NUM_CTX=8192 OLLAMA_FLASH_ATTENTION=0 OLLAMA_HOST=0.0.0.0 OLLAMA_LIBRARY_PATH=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13 time=2026-04-06T09:25:55.388+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine" time=2026-04-06T09:25:55.389+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:51196" time=2026-04-06T09:25:55.399+01:00 level=DEBUG source=gguf.go:604 msg=general.architecture type=string time=2026-04-06T09:25:55.399+01:00 level=DEBUG source=gguf.go:604 msg=tokenizer.ggml.model type=string time=2026-04-06T09:25:55.400+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:25:55.400+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:25:55.400+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.file_type default=0 time=2026-04-06T09:25:55.400+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default="" time=2026-04-06T09:25:55.400+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default="" time=2026-04-06T09:25:55.400+01:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3 time=2026-04-06T09:25:55.400+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll time=2026-04-06T09:25:55.416+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-84f27e06-0203-3d52-a176-80d1f45cd22e load_backend: loaded CUDA backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2026-04-06T09:25:55.505+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0 time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.pooling_type default=0 time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.expert_count default=0 time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}" time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}" time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}" time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}" time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0 time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.pre default="" time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.block_count default=0 time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.embedding_length default=0 time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count default=0 time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.head_count_kv default=0 time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.key_length default=0 time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.dimension_count default=0 time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0 time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.freq_base default=100000 time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=llama.rope.scaling.factor default=1 time=2026-04-06T09:25:55.505+01:00 level=DEBUG source=runner.go:1392 msg="dummy model load took" duration=106.2298ms ggml_backend_cuda_device_get_memory device GPU-84f27e06-0203-3d52-a176-80d1f45cd22e utilizing NVML memory reporting free: 24251162624 total: 25769803776 time=2026-04-06T09:25:55.525+01:00 level=DEBUG source=runner.go:1397 msg="gathering device infos took" duration=20.3965ms time=2026-04-06T09:25:55.526+01:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=301.1449ms OLLAMA_LIBRARY_PATH="[C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[] time=2026-04-06T09:25:55.526+01:00 level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=301.1449ms time=2026-04-06T09:25:55.526+01:00 level=INFO source=cpu_windows.go:148 msg=packages count=1 time=2026-04-06T09:25:55.527+01:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=8 efficiency=0 threads=16 time=2026-04-06T09:25:55.527+01:00 level=DEBUG source=sched.go:220 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2026-04-06T09:25:55.527+01:00 level=DEBUG source=sched.go:229 msg="loading first model" model=C:\Users\patri\.ollama\models\blobs\sha256-280af6832eca23cb322c4dcc65edfea98a21b8f8ab07dc7553bd6f7e6e7a3313 time=2026-04-06T09:25:55.657+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0 time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106 time=2026-04-06T09:25:55.730+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_count_kv default=0 time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0 time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count default=0 time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.block_count default=0 time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.embedding_length default=0 time=2026-04-06T09:25:55.730+01:00 level=WARN source=server.go:258 msg="quantized kv cache requested but flash attention disabled" type=q8_0 time=2026-04-06T09:25:55.730+01:00 level=INFO source=server.go:432 msg="starting runner" cmd="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\patri\\.ollama\\models\\blobs\\sha256-280af6832eca23cb322c4dcc65edfea98a21b8f8ab07dc7553bd6f7e6e7a3313 --port 51213" time=2026-04-06T09:25:55.730+01:00 level=DEBUG source=server.go:433 msg=subprocess OLLAMA_NO_CLOUD=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=30m PATH="C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\dotnet\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\CMake\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\patri\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\patri\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\patri\\.local\\bin;C:\\Users\\patri\\AppData\\Local\\Programs\\Ollama;C:\\Users\\patri\\AppData\\Local\\Python\\bin;C:\\Users\\patri\\AppData\\Roaming\\npm" OLLAMA_CONTEXT_LENGTH=16384 OLLAMA_DEBUG=2 OLLAMA_MODELS=C:\Users\patri\.ollama\models OLLAMA_NUM_CTX=8192 OLLAMA_FLASH_ATTENTION=0 OLLAMA_HOST=0.0.0.0 OLLAMA_LIBRARY_PATH=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13 time=2026-04-06T09:25:55.739+01:00 level=INFO source=sched.go:484 msg="system memory" total="15.9 GiB" free="10.2 GiB" free_swap="19.7 GiB" time=2026-04-06T09:25:55.739+01:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA available="22.1 GiB" free="22.6 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-04-06T09:25:55.739+01:00 level=INFO source=server.go:759 msg="loading model" "model layers"=61 requested=-1 time=2026-04-06T09:25:55.863+01:00 level=INFO source=runner.go:1417 msg="starting ollama engine" time=2026-04-06T09:25:55.864+01:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:51213" time=2026-04-06T09:25:55.875+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:61[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:25:55.950+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:25:55.955+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default="" time=2026-04-06T09:25:55.955+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default="" time=2026-04-06T09:25:55.955+01:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1189 num_key_values=49 time=2026-04-06T09:25:55.955+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll time=2026-04-06T09:25:55.972+01:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-84f27e06-0203-3d52-a176-80d1f45cd22e load_backend: loaded CUDA backend from C:\Users\patri\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2026-04-06T09:25:56.075+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-04-06T09:25:56.095+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0 time=2026-04-06T09:25:56.095+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106 time=2026-04-06T09:25:56.095+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-06T09:25:56.095+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_count_kv default=0 time=2026-04-06T09:25:56.095+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0 time=2026-04-06T09:25:56.095+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count default=0 time=2026-04-06T09:25:56.095+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.block_count default=0 time=2026-04-06T09:25:56.095+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.embedding_length default=0 time=2026-04-06T09:25:56.125+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=3.1229ms bounds=(0,0)-(2048,2048) time=2026-04-06T09:25:56.269+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=144.0074ms size="[768 768]" time=2026-04-06T09:25:56.269+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-06T09:25:56.269+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-06T09:25:56.274+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=152.1374ms shape="[5376 256]" time=2026-04-06T09:25:56.381+01:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1272 splits=1 time=2026-04-06T09:25:56.536+01:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2752 splits=2 time=2026-04-06T09:25:56.583+01:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2750 splits=2 time=2026-04-06T09:25:56.588+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="18.4 GiB" time=2026-04-06T09:25:56.588+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.2 GiB" time=2026-04-06T09:25:56.588+01:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="4.8 GiB" time=2026-04-06T09:25:56.588+01:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.7 GiB" time=2026-04-06T09:25:56.588+01:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="10.5 MiB" time=2026-04-06T09:25:56.588+01:00 level=DEBUG source=device.go:272 msg="total memory" size="26.0 GiB" ```

GiteaMirror commented

2026-04-22 20:10:33 -05:00

@Issueposter commented on GitHub (Apr 6, 2026):

Complete server.log with OLLAMA_DEBUG=2 — part 2/2

time=2026-04-06T09:25:56.588+01:00 level=DEBUG source=server.go:784 msg=memory success=true required.InputWeights=1250426880 required.CPU.Graph=11010048 required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[304974208 304974208 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 2260644352]" required.CUDA0.Cache="[75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 0]" required.CUDA0.Graph=1778129024
time=2026-04-06T09:25:56.589+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="20.5 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="1.7 GiB"
time=2026-04-06T09:25:56.589+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)]"
time=2026-04-06T09:25:56.589+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:25:56.659+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-06T09:25:56.674+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0
time=2026-04-06T09:25:56.674+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106
time=2026-04-06T09:25:56.674+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-06T09:25:56.674+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_count_kv default=0
time=2026-04-06T09:25:56.674+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0
time=2026-04-06T09:25:56.674+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count default=0
time=2026-04-06T09:25:56.674+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.block_count default=0
time=2026-04-06T09:25:56.674+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.embedding_length default=0
time=2026-04-06T09:25:56.693+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=526A�s bounds=(0,0)-(2048,2048)
time=2026-04-06T09:25:56.818+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=125.1303ms size="[768 768]"
time=2026-04-06T09:25:56.818+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
time=2026-04-06T09:25:56.818+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=2026-04-06T09:25:56.818+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=125.6563ms shape="[5376 256]"
time=2026-04-06T09:25:56.818+01:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1272 splits=355
time=2026-04-06T09:25:56.900+01:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2752 splits=35
time=2026-04-06T09:25:56.908+01:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2750 splits=3
time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="15.7 GiB"
time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.8 GiB"
time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="4.6 GiB"
time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="144.0 MiB"
time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.7 GiB"
time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="16.0 MiB"
time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=device.go:272 msg="total memory" size="26.0 GiB"
time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=server.go:784 msg=memory success=true required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CPU.Cache="[75497472 75497472 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=16777216 required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" required.CUDA0.Cache="[0 0 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 0]" required.CUDA0.Graph=1779960832
time=2026-04-06T09:25:56.910+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="20.5 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="1.7 GiB"
time=2026-04-06T09:25:56.910+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)]"
time=2026-04-06T09:25:56.910+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:25:56.979+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16105.95 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 16888315136
time=2026-04-06T09:26:11.721+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="15.7 GiB"
time=2026-04-06T09:26:11.721+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.8 GiB"
time=2026-04-06T09:26:11.721+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:26:11.721+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:26:11.722+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="22.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:26:11.722+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="61[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:61(0..60)]"
time=2026-04-06T09:26:11.722+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=60
time=2026-04-06T09:26:11.722+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="22.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:26:11.723+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="60[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:60(0..59)]"
time=2026-04-06T09:26:11.723+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:60[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:60(0..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:26:11.865+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16687.64 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 17498263552
time=2026-04-06T09:26:22.179+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="16.3 GiB"
time=2026-04-06T09:26:22.179+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.3 GiB"
time=2026-04-06T09:26:22.179+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:26:22.179+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[304974208 304974208 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:26:22.179+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=59
time=2026-04-06T09:26:22.179+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="22.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:26:22.180+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)]"
time=2026-04-06T09:26:22.180+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:26:22.315+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16396.80 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 17193289344
time=2026-04-06T09:26:33.602+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="16.0 GiB"
time=2026-04-06T09:26:33.603+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.6 GiB"
time=2026-04-06T09:26:33.603+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:26:33.603+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 304974208 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:26:33.603+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=58
time=2026-04-06T09:26:33.603+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="22.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:26:33.603+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)]"
time=2026-04-06T09:26:33.604+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:26:33.735+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16105.95 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 16888315136
time=2026-04-06T09:26:46.033+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="15.7 GiB"
time=2026-04-06T09:26:46.033+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.8 GiB"
time=2026-04-06T09:26:46.033+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:26:46.034+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:26:46.034+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.10
time=2026-04-06T09:26:46.034+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="19.9 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:26:46.034+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="61[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:61(0..60)]"
time=2026-04-06T09:26:46.034+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=60
time=2026-04-06T09:26:46.034+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="19.9 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:26:46.035+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="60[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:60(0..59)]"
time=2026-04-06T09:26:46.035+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:60[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:60(0..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:26:46.172+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16687.64 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 17498263552
time=2026-04-06T09:26:56.511+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="16.3 GiB"
time=2026-04-06T09:26:56.511+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.3 GiB"
time=2026-04-06T09:26:56.511+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:26:56.511+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[304974208 304974208 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:26:56.526+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=59
time=2026-04-06T09:26:56.527+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="19.9 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:26:56.527+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)]"
time=2026-04-06T09:26:56.527+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:26:56.660+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16396.80 MiB on device 0: cudaMalloc failed: out of memory
time=2026-04-06T09:27:06.935+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="16.0 GiB"
time=2026-04-06T09:27:06.935+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.6 GiB"
alloc_tensor_range: failed to allocate CUDA0 buffer of size 17193289344
time=2026-04-06T09:27:06.935+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:27:06.935+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 304974208 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:27:06.935+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=58
time=2026-04-06T09:27:06.936+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="19.9 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:27:06.936+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)]"
time=2026-04-06T09:27:06.936+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:27:07.073+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16105.95 MiB on device 0: cudaMalloc failed: out of memory
time=2026-04-06T09:27:17.295+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="15.7 GiB"
time=2026-04-06T09:27:17.295+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.8 GiB"
alloc_tensor_range: failed to allocate CUDA0 buffer of size 16888315136
time=2026-04-06T09:27:17.295+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:27:17.295+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:27:17.296+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.20
time=2026-04-06T09:27:17.296+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="17.6 GiB" backoff=0.20 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:27:17.297+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="60[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:60(0..59)]"
time=2026-04-06T09:27:17.297+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=59
time=2026-04-06T09:27:17.297+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="17.6 GiB" backoff=0.20 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:27:17.297+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)]"
time=2026-04-06T09:27:17.297+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:27:17.431+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16396.80 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 17193289344
time=2026-04-06T09:27:27.655+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="16.0 GiB"
time=2026-04-06T09:27:27.655+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.6 GiB"
time=2026-04-06T09:27:27.655+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:27:27.655+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 304974208 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:27:27.655+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=58
time=2026-04-06T09:27:27.655+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="17.6 GiB" backoff=0.20 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:27:27.656+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)]"
time=2026-04-06T09:27:27.656+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:27:27.787+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16105.95 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 16888315136
time=2026-04-06T09:27:37.993+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="15.7 GiB"
time=2026-04-06T09:27:37.993+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.8 GiB"
time=2026-04-06T09:27:37.993+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:27:37.993+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:27:37.993+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.30
time=2026-04-06T09:27:37.993+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="15.4 GiB" backoff=0.30 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:27:37.993+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="56[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:56(4..59)]"
time=2026-04-06T09:27:37.993+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:56[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:56(4..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:27:38.139+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 15524.26 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 16278366720
time=2026-04-06T09:27:48.331+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="15.2 GiB"
time=2026-04-06T09:27:48.331+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="4.4 GiB"
time=2026-04-06T09:27:48.331+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:27:48.331+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 0 0 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:27:48.331+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="15.4 GiB" backoff=0.30 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:27:48.331+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="56[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:56(4..59)]"
time=2026-04-06T09:27:48.332+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.40
time=2026-04-06T09:27:48.332+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="13.1 GiB" backoff=0.40 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:27:48.332+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="48[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:48(12..59)]"
time=2026-04-06T09:27:48.332+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:48[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:48(12..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:27:48.464+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 13268.36 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 13912887296
time=2026-04-06T09:27:58.682+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="13.0 GiB"
time=2026-04-06T09:27:58.682+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="6.6 GiB"
time=2026-04-06T09:27:58.682+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:27:58.682+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:27:58.683+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="13.1 GiB" backoff=0.40 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:27:58.683+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="48[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:48(12..59)]"
time=2026-04-06T09:27:58.684+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.50
time=2026-04-06T09:27:58.685+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="10.8 GiB" backoff=0.50 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:27:58.685+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="40[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:40(20..59)]"
time=2026-04-06T09:27:58.685+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:40[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:40(20..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:27:59.076+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 11086.67 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 11625211136
time=2026-04-06T09:28:09.236+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="10.8 GiB"
time=2026-04-06T09:28:09.237+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="8.7 GiB"
time=2026-04-06T09:28:09.237+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:28:09.237+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:28:09.237+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="10.8 GiB" backoff=0.50 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:28:09.238+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="40[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:40(20..59)]"
time=2026-04-06T09:28:09.238+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.60
time=2026-04-06T09:28:09.238+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="8.6 GiB" backoff=0.60 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:28:09.238+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="31[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:31(29..59)]"
time=2026-04-06T09:28:09.238+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:31[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:31(29..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:28:09.411+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 11946925152
alloc_tensor_range: failed to allocate CPU buffer of size 11946925152
time=2026-04-06T09:28:09.445+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="8.4 GiB"
time=2026-04-06T09:28:09.445+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="11.1 GiB"
time=2026-04-06T09:28:09.445+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:28:09.445+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:28:09.445+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="8.6 GiB" backoff=0.60 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:28:09.446+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="31[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:31(29..59)]"
time=2026-04-06T09:28:09.446+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.70
time=2026-04-06T09:28:09.446+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="6.3 GiB" backoff=0.70 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:28:09.446+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="23[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:23(37..59)]"
time=2026-04-06T09:28:09.446+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:23[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:23(37..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:28:09.618+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 14265558368
alloc_tensor_range: failed to allocate CPU buffer of size 14265558368
time=2026-04-06T09:28:10.648+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="6.3 GiB"
time=2026-04-06T09:28:10.648+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="13.3 GiB"
time=2026-04-06T09:28:10.648+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:28:10.648+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:28:10.649+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="6.3 GiB" backoff=0.70 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:28:10.649+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="23[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:23(37..59)]"
time=2026-04-06T09:28:10.649+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.80
time=2026-04-06T09:28:10.650+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="4.1 GiB" backoff=0.80 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:28:10.650+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="14[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:14(46..59)]"
time=2026-04-06T09:28:10.650+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:14[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:14(46..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:28:10.712+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 16828392064
alloc_tensor_range: failed to allocate CPU buffer of size 16828392064
time=2026-04-06T09:28:11.749+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="3.9 GiB"
time=2026-04-06T09:28:11.749+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="15.7 GiB"
time=2026-04-06T09:28:11.749+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:28:11.749+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:28:11.750+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="4.1 GiB" backoff=0.80 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:28:11.750+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="14[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:14(46..59)]"
time=2026-04-06T09:28:11.750+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.90
time=2026-04-06T09:28:11.750+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="1.8 GiB" backoff=0.90 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:28:11.750+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="6[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:6(54..59)]"
time=2026-04-06T09:28:11.750+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:6[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:6(54..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:28:11.819+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 19176829824
alloc_tensor_range: failed to allocate CPU buffer of size 19176829824
time=2026-04-06T09:28:12.849+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="1.7 GiB"
time=2026-04-06T09:28:12.849+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="17.9 GiB"
time=2026-04-06T09:28:12.849+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:28:12.849+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 304972832 330263584 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 299297152 299297152 304974208 299297152 299297152 330264704 0]"
time=2026-04-06T09:28:12.850+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="1.8 GiB" backoff=0.90 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:28:12.850+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="6[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:6(54..59)]"
time=2026-04-06T09:28:12.850+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=1.00
time=2026-04-06T09:28:12.850+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="0 B" backoff=1.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:28:12.851+01:00 level=DEBUG source=server.go:1059 msg="insufficient VRAM to load any model layers"
time=2026-04-06T09:28:12.851+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers=[]
time=2026-04-06T09:28:12.851+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:28:12.914+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 21009249344
alloc_tensor_range: failed to allocate CPU buffer of size 21009249344
time=2026-04-06T09:28:13.942+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="19.6 GiB"
time=2026-04-06T09:28:13.942+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:28:13.942+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 304972832 330263584 299295776 299295776 304972832 299295776 299295776 330263584 2260638912]"
time=2026-04-06T09:28:13.943+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="0 B" backoff=1.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-04-06T09:28:13.943+01:00 level=DEBUG source=server.go:1059 msg="insufficient VRAM to load any model layers"
time=2026-04-06T09:28:13.943+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers=[]
time=2026-04-06T09:28:13.943+01:00 level=WARN source=server.go:875 msg="memory layout cannot be allocated" memory.InputWeights=1250426880 memory.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 304972832 330263584 299295776 299295776 304972832 299295776 299295776 330263584 2260638912]"
time=2026-04-06T09:28:13.943+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:Disabled KvSize:0 KvCacheType: NumThreads:0 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-06T09:28:13.943+01:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="19.6 GiB"
time=2026-04-06T09:28:13.943+01:00 level=INFO source=device.go:272 msg="total memory" size="19.6 GiB"
time=2026-04-06T09:28:13.943+01:00 level=INFO source=sched.go:511 msg="Load failed" model=C:\Users\patri\.ollama\models\blobs\sha256-280af6832eca23cb322c4dcc65edfea98a21b8f8ab07dc7553bd6f7e6e7a3313 error="memory layout cannot be allocated"
time=2026-04-06T09:28:13.943+01:00 level=DEBUG source=server.go:1832 msg="stopping llama server" pid=11348
time=2026-04-06T09:28:13.944+01:00 level=DEBUG source=server.go:1838 msg="waiting for llama server to exit" pid=11348
time=2026-04-06T09:28:14.024+01:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"
time=2026-04-06T09:28:14.024+01:00 level=DEBUG source=server.go:1842 msg="llama server stopped" pid=11348
[GIN] 2026/04/06 - 09:28:14 | 500 |         2m19s |             ::1 | POST     "/api/generate"

@Issueposter commented on GitHub (Apr 6, 2026): Complete `server.log` with `OLLAMA_DEBUG=2` — part 2/2 ``` time=2026-04-06T09:25:56.588+01:00 level=DEBUG source=server.go:784 msg=memory success=true required.InputWeights=1250426880 required.CPU.Graph=11010048 required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[304974208 304974208 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 2260644352]" required.CUDA0.Cache="[75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 0]" required.CUDA0.Graph=1778129024 time=2026-04-06T09:25:56.589+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="20.5 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="1.7 GiB" time=2026-04-06T09:25:56.589+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)]" time=2026-04-06T09:25:56.589+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:25:56.659+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-06T09:25:56.674+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0 time=2026-04-06T09:25:56.674+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106 time=2026-04-06T09:25:56.674+01:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-06T09:25:56.674+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_count_kv default=0 time=2026-04-06T09:25:56.674+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0 time=2026-04-06T09:25:56.674+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count default=0 time=2026-04-06T09:25:56.674+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.block_count default=0 time=2026-04-06T09:25:56.674+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.embedding_length default=0 time=2026-04-06T09:25:56.693+01:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=526A�s bounds=(0,0)-(2048,2048) time=2026-04-06T09:25:56.818+01:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=125.1303ms size="[768 768]" time=2026-04-06T09:25:56.818+01:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-06T09:25:56.818+01:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-06T09:25:56.818+01:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=125.6563ms shape="[5376 256]" time=2026-04-06T09:25:56.818+01:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1272 splits=355 time=2026-04-06T09:25:56.900+01:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2752 splits=35 time=2026-04-06T09:25:56.908+01:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2750 splits=3 time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="15.7 GiB" time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.8 GiB" time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="4.6 GiB" time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="144.0 MiB" time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.7 GiB" time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="16.0 MiB" time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=device.go:272 msg="total memory" size="26.0 GiB" time=2026-04-06T09:25:56.909+01:00 level=DEBUG source=server.go:784 msg=memory success=true required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CPU.Cache="[75497472 75497472 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=16777216 required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" required.CUDA0.Cache="[0 0 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 75497472 75497472 75497472 75497472 75497472 134217728 0]" required.CUDA0.Graph=1779960832 time=2026-04-06T09:25:56.910+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="20.5 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="1.7 GiB" time=2026-04-06T09:25:56.910+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)]" time=2026-04-06T09:25:56.910+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:25:56.979+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16105.95 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 16888315136 time=2026-04-06T09:26:11.721+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="15.7 GiB" time=2026-04-06T09:26:11.721+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.8 GiB" time=2026-04-06T09:26:11.721+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:26:11.721+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:26:11.722+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="22.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:26:11.722+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="61[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:61(0..60)]" time=2026-04-06T09:26:11.722+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=60 time=2026-04-06T09:26:11.722+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="22.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:26:11.723+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="60[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:60(0..59)]" time=2026-04-06T09:26:11.723+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:60[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:60(0..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:26:11.865+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16687.64 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 17498263552 time=2026-04-06T09:26:22.179+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="16.3 GiB" time=2026-04-06T09:26:22.179+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.3 GiB" time=2026-04-06T09:26:22.179+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:26:22.179+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[304974208 304974208 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:26:22.179+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=59 time=2026-04-06T09:26:22.179+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="22.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:26:22.180+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)]" time=2026-04-06T09:26:22.180+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:26:22.315+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16396.80 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 17193289344 time=2026-04-06T09:26:33.602+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="16.0 GiB" time=2026-04-06T09:26:33.603+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.6 GiB" time=2026-04-06T09:26:33.603+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:26:33.603+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 304974208 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:26:33.603+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=58 time=2026-04-06T09:26:33.603+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="22.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:26:33.603+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)]" time=2026-04-06T09:26:33.604+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:26:33.735+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16105.95 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 16888315136 time=2026-04-06T09:26:46.033+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="15.7 GiB" time=2026-04-06T09:26:46.033+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.8 GiB" time=2026-04-06T09:26:46.033+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:26:46.034+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:26:46.034+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.10 time=2026-04-06T09:26:46.034+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="19.9 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:26:46.034+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="61[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:61(0..60)]" time=2026-04-06T09:26:46.034+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=60 time=2026-04-06T09:26:46.034+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="19.9 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:26:46.035+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="60[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:60(0..59)]" time=2026-04-06T09:26:46.035+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:60[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:60(0..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:26:46.172+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16687.64 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 17498263552 time=2026-04-06T09:26:56.511+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="16.3 GiB" time=2026-04-06T09:26:56.511+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.3 GiB" time=2026-04-06T09:26:56.511+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:26:56.511+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[304974208 304974208 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:26:56.526+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=59 time=2026-04-06T09:26:56.527+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="19.9 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:26:56.527+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)]" time=2026-04-06T09:26:56.527+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:26:56.660+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16396.80 MiB on device 0: cudaMalloc failed: out of memory time=2026-04-06T09:27:06.935+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="16.0 GiB" time=2026-04-06T09:27:06.935+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.6 GiB" alloc_tensor_range: failed to allocate CUDA0 buffer of size 17193289344 time=2026-04-06T09:27:06.935+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:27:06.935+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 304974208 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:27:06.935+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=58 time=2026-04-06T09:27:06.936+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="19.9 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:27:06.936+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)]" time=2026-04-06T09:27:06.936+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:27:07.073+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16105.95 MiB on device 0: cudaMalloc failed: out of memory time=2026-04-06T09:27:17.295+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="15.7 GiB" time=2026-04-06T09:27:17.295+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.8 GiB" alloc_tensor_range: failed to allocate CUDA0 buffer of size 16888315136 time=2026-04-06T09:27:17.295+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:27:17.295+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:27:17.296+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.20 time=2026-04-06T09:27:17.296+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="17.6 GiB" backoff=0.20 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:27:17.297+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="60[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:60(0..59)]" time=2026-04-06T09:27:17.297+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=59 time=2026-04-06T09:27:17.297+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="17.6 GiB" backoff=0.20 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:27:17.297+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)]" time=2026-04-06T09:27:17.297+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:59[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:59(1..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:27:17.431+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16396.80 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 17193289344 time=2026-04-06T09:27:27.655+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="16.0 GiB" time=2026-04-06T09:27:27.655+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.6 GiB" time=2026-04-06T09:27:27.655+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:27:27.655+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 304974208 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:27:27.655+01:00 level=DEBUG source=server.go:820 msg="exploring intermediate layers" layer=58 time=2026-04-06T09:27:27.655+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="17.6 GiB" backoff=0.20 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:27:27.656+01:00 level=DEBUG source=server.go:828 msg="new layout created" layers="58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)]" time=2026-04-06T09:27:27.656+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:58[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:58(2..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:27:27.787+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16105.95 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 16888315136 time=2026-04-06T09:27:37.993+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="15.7 GiB" time=2026-04-06T09:27:37.993+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.8 GiB" time=2026-04-06T09:27:37.993+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:27:37.993+01:00 level=DEBUG source=server.go:837 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 304974208 304974208 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:27:37.993+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.30 time=2026-04-06T09:27:37.993+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="15.4 GiB" backoff=0.30 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:27:37.993+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="56[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:56(4..59)]" time=2026-04-06T09:27:37.993+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:56[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:56(4..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:27:38.139+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 15524.26 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 16278366720 time=2026-04-06T09:27:48.331+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="15.2 GiB" time=2026-04-06T09:27:48.331+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="4.4 GiB" time=2026-04-06T09:27:48.331+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:27:48.331+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 0 0 304974208 330264704 304974208 275169664 275169664 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:27:48.331+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="15.4 GiB" backoff=0.30 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:27:48.331+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="56[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:56(4..59)]" time=2026-04-06T09:27:48.332+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.40 time=2026-04-06T09:27:48.332+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="13.1 GiB" backoff=0.40 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:27:48.332+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="48[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:48(12..59)]" time=2026-04-06T09:27:48.332+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:48[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:48(12..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:27:48.464+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 13268.36 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 13912887296 time=2026-04-06T09:27:58.682+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="13.0 GiB" time=2026-04-06T09:27:58.682+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="6.6 GiB" time=2026-04-06T09:27:58.682+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:27:58.682+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:27:58.683+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="13.1 GiB" backoff=0.40 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:27:58.683+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="48[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:48(12..59)]" time=2026-04-06T09:27:58.684+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.50 time=2026-04-06T09:27:58.685+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="10.8 GiB" backoff=0.50 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:27:58.685+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="40[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:40(20..59)]" time=2026-04-06T09:27:58.685+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:40[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:40(20..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:27:59.076+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 11086.67 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 11625211136 time=2026-04-06T09:28:09.236+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="10.8 GiB" time=2026-04-06T09:28:09.237+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="8.7 GiB" time=2026-04-06T09:28:09.237+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:28:09.237+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:28:09.237+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="10.8 GiB" backoff=0.50 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:28:09.238+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="40[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:40(20..59)]" time=2026-04-06T09:28:09.238+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.60 time=2026-04-06T09:28:09.238+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="8.6 GiB" backoff=0.60 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:28:09.238+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="31[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:31(29..59)]" time=2026-04-06T09:28:09.238+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:31[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:31(29..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:28:09.411+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 11946925152 alloc_tensor_range: failed to allocate CPU buffer of size 11946925152 time=2026-04-06T09:28:09.445+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="8.4 GiB" time=2026-04-06T09:28:09.445+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="11.1 GiB" time=2026-04-06T09:28:09.445+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:28:09.445+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 300460160 299297152 275169664 269492608 299297152 275169664 300460160 299297152 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:28:09.445+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="8.6 GiB" backoff=0.60 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:28:09.446+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="31[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:31(29..59)]" time=2026-04-06T09:28:09.446+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.70 time=2026-04-06T09:28:09.446+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="6.3 GiB" backoff=0.70 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:28:09.446+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="23[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:23(37..59)]" time=2026-04-06T09:28:09.446+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:23[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:23(37..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:28:09.618+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 14265558368 alloc_tensor_range: failed to allocate CPU buffer of size 14265558368 time=2026-04-06T09:28:10.648+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="6.3 GiB" time=2026-04-06T09:28:10.648+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="13.3 GiB" time=2026-04-06T09:28:10.648+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:28:10.648+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 269492608 275169664 299297152 269492608 300460160 304974208 269492608 269492608 304974208 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:28:10.649+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="6.3 GiB" backoff=0.70 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:28:10.649+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="23[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:23(37..59)]" time=2026-04-06T09:28:10.649+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.80 time=2026-04-06T09:28:10.650+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="4.1 GiB" backoff=0.80 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:28:10.650+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="14[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:14(46..59)]" time=2026-04-06T09:28:10.650+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:14[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:14(46..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:28:10.712+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 16828392064 alloc_tensor_range: failed to allocate CPU buffer of size 16828392064 time=2026-04-06T09:28:11.749+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="3.9 GiB" time=2026-04-06T09:28:11.749+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="15.7 GiB" time=2026-04-06T09:28:11.749+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:28:11.749+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 269492608 300460160 299297152 275169664 269492608 299297152 304974208 330264704 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:28:11.750+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="4.1 GiB" backoff=0.80 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:28:11.750+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="14[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:14(46..59)]" time=2026-04-06T09:28:11.750+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.90 time=2026-04-06T09:28:11.750+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="1.8 GiB" backoff=0.90 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:28:11.750+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="6[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:6(54..59)]" time=2026-04-06T09:28:11.750+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:6[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:6(54..59)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:28:11.819+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 19176829824 alloc_tensor_range: failed to allocate CPU buffer of size 19176829824 time=2026-04-06T09:28:12.849+01:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="1.7 GiB" time=2026-04-06T09:28:12.849+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="17.9 GiB" time=2026-04-06T09:28:12.849+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:28:12.849+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 304972832 330263584 0 0 0 0 0 0 2260638912]" required.CUDA0.ID=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 299297152 299297152 304974208 299297152 299297152 330264704 0]" time=2026-04-06T09:28:12.850+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="1.8 GiB" backoff=0.90 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:28:12.850+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers="6[ID:GPU-84f27e06-0203-3d52-a176-80d1f45cd22e Layers:6(54..59)]" time=2026-04-06T09:28:12.850+01:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=1.00 time=2026-04-06T09:28:12.850+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="0 B" backoff=1.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:28:12.851+01:00 level=DEBUG source=server.go:1059 msg="insufficient VRAM to load any model layers" time=2026-04-06T09:28:12.851+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers=[] time=2026-04-06T09:28:12.851+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:16384 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:28:12.914+01:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 21009249344 alloc_tensor_range: failed to allocate CPU buffer of size 21009249344 time=2026-04-06T09:28:13.942+01:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="19.6 GiB" time=2026-04-06T09:28:13.942+01:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:28:13.942+01:00 level=DEBUG source=server.go:784 msg=memory success=false required.InputWeights=1250426880 required.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 304972832 330263584 299295776 299295776 304972832 299295776 299295776 330263584 2260638912]" time=2026-04-06T09:28:13.943+01:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-84f27e06-0203-3d52-a176-80d1f45cd22e library=CUDA "available layer vram"="0 B" backoff=1.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-04-06T09:28:13.943+01:00 level=DEBUG source=server.go:1059 msg="insufficient VRAM to load any model layers" time=2026-04-06T09:28:13.943+01:00 level=DEBUG source=server.go:795 msg="new layout created" layers=[] time=2026-04-06T09:28:13.943+01:00 level=WARN source=server.go:875 msg="memory layout cannot be allocated" memory.InputWeights=1250426880 memory.CPU.Weights="[304972832 304972832 304972832 304972832 304972832 330263584 304972832 275168288 275168288 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 275168288 300459040 299295776 269491232 275168288 299295776 269491232 300459040 304972832 269491232 269491232 304972832 269491232 300459040 299295776 275168288 269491232 299295776 304972832 330263584 299295776 299295776 304972832 299295776 299295776 330263584 2260638912]" time=2026-04-06T09:28:13.943+01:00 level=INFO source=runner.go:1290 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:Disabled KvSize:0 KvCacheType: NumThreads:0 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-06T09:28:13.943+01:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="19.6 GiB" time=2026-04-06T09:28:13.943+01:00 level=INFO source=device.go:272 msg="total memory" size="19.6 GiB" time=2026-04-06T09:28:13.943+01:00 level=INFO source=sched.go:511 msg="Load failed" model=C:\Users\patri\.ollama\models\blobs\sha256-280af6832eca23cb322c4dcc65edfea98a21b8f8ab07dc7553bd6f7e6e7a3313 error="memory layout cannot be allocated" time=2026-04-06T09:28:13.943+01:00 level=DEBUG source=server.go:1832 msg="stopping llama server" pid=11348 time=2026-04-06T09:28:13.944+01:00 level=DEBUG source=server.go:1838 msg="waiting for llama server to exit" pid=11348 time=2026-04-06T09:28:14.024+01:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1" time=2026-04-06T09:28:14.024+01:00 level=DEBUG source=server.go:1842 msg="llama server stopped" pid=11348 [GIN] 2026/04/06 - 09:28:14 | 500 | 2m19s | ::1 | POST "/api/generate" ```

GiteaMirror commented

2026-04-22 20:10:34 -05:00

@KarimGeiger commented on GitHub (Apr 9, 2026):

I seem to be running into the same issue. Disabling flash attention does not appear to fix the issue.

Task Manager shows multiple attempts at loading the model into GPU and CPU memory (with GPU memory going down and CPU memory going up during subsequent attempts). I decided to kill the server after 10 minutes without a response.

Server log with debug=2:

server.log.zip

@KarimGeiger commented on GitHub (Apr 9, 2026): I seem to be running into the same issue. Disabling flash attention does not appear to fix the issue. Task Manager shows multiple attempts at loading the model into GPU and CPU memory (with GPU memory going down and CPU memory going up during subsequent attempts). I decided to kill the server after 10 minutes without a response. Server log with debug=2: [server.log.zip](https://github.com/user-attachments/files/26596728/server.log.zip) <img width="927" height="667" alt="Image" src="https://github.com/user-attachments/assets/d29af797-0e23-4869-8fd2-9eade9ad4b2c" /> <img width="914" height="798" alt="Image" src="https://github.com/user-attachments/assets/763379d2-ddc0-4e56-8794-3f46a113ff02" />

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#35581