[GH-ISSUE #10233] GGML_ASSERT Crash with Parallel Requests and Shared Memory #6715

Open
opened 2026-04-12 18:27:37 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @forReason on GitHub (Apr 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10233

What is the issue?

Summary:
The Ollama runner intermittently crashes with a low-level GGML assertion error when using OLLAMA_NUM_PARALLEL=6 and OLLAMA_SHAREDMEM=1. logs show repeated assertion failures, GPU memory not releasing, and forcibly closed socket connections.


Environment

Key Value
Model * gemma:7b or custom)](gemma3:27b-it-fp16*
Backend CUDA (2x NVIDIA A40-48Q, 48GB)
OLLAMA_NUM_PARALLEL 6
OLLAMA_SHAREDMEM 1

Error Messages

💥 GGML Assertion:

ggml.c:1584: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed

Server Termination:

source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409"

🔄 Repeated API Failures:

"error": "an error was encountered while running the model: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed"
"error": "POST predict: ... wsarecv: An existing connection was forcibly closed by the remote host."

📉 Observed Behavior
The error does not happen immediately, but appears after several generations, especially under load.

GPU usage per card hovers around 32–34 GB out of 48 GB, but crashes still occur.

Shared memory seems to increase from 0.1 -2 gb until crash

Relevant log output

2025/04/11 12:11:05 routes.go:1231: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\devops01\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:6 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-04-11T12:11:05.582+02:00 level=INFO source=images.go:458 msg="total blobs: 17"
time=2025-04-11T12:11:05.584+02:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0"
time=2025-04-11T12:11:05.596+02:00 level=INFO source=routes.go:1298 msg="Listening on [::]:11434 (version 0.6.4)"
time=2025-04-11T12:11:05.600+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-04-11T12:11:05.603+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=4
time=2025-04-11T12:11:05.603+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=2 efficiency=0 threads=2
time=2025-04-11T12:11:05.603+02:00 level=INFO source=gpu_windows.go:214 msg="" package=1 cores=2 efficiency=0 threads=2
time=2025-04-11T12:11:05.603+02:00 level=INFO source=gpu_windows.go:214 msg="" package=2 cores=2 efficiency=0 threads=2
time=2025-04-11T12:11:05.603+02:00 level=INFO source=gpu_windows.go:214 msg="" package=3 cores=2 efficiency=0 threads=2
time=2025-04-11T12:11:06.304+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-cb794310-3913-11b2-bd79-c22e9e69f4a7 library=cuda compute=8.6 driver=11.4 name="NVIDIA A40-48Q" overhead="774.6 MiB"
time=2025-04-11T12:11:06.573+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-ce366d7e-3913-11b2-9e88-48d7ce79eeca library=cuda compute=8.6 driver=11.4 name="NVIDIA A40-48Q" overhead="1000.6 MiB"
time=2025-04-11T12:11:06.577+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-cb794310-3913-11b2-bd79-c22e9e69f4a7 library=cuda variant=v11 compute=8.6 driver=11.4 name="NVIDIA A40-48Q" total="48.0 GiB" available="42.9 GiB"
time=2025-04-11T12:11:06.577+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ce366d7e-3913-11b2-9e88-48d7ce79eeca library=cuda variant=v11 compute=8.6 driver=11.4 name="NVIDIA A40-48Q" total="48.0 GiB" available="42.9 GiB"
time=2025-04-11T12:12:45.120+02:00 level=INFO source=server.go:105 msg="system memory" total="256.0 GiB" free="250.6 GiB" free_swap="285.1 GiB"
time=2025-04-11T12:12:45.123+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=999 layers.model=63 layers.offload=63 layers.split=32,31 memory.available="[43.0 GiB 42.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="78.0 GiB" memory.required.partial="78.0 GiB" memory.required.kv="10.1 GiB" memory.required.allocations="[37.7 GiB 40.3 GiB]" memory.weights.total="50.3 GiB" memory.weights.repeating="47.7 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="6.6 GiB" memory.graph.partial="6.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-04-11T12:12:45.279+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-11T12:12:45.290+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-11T12:12:45.291+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-11T12:12:45.291+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-11T12:12:45.291+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-11T12:12:45.291+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-11T12:12:45.303+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\devops01\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\devops01\\.ollama\\models\\blobs\\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 --ctx-size 98304 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 6 --tensor-split 32,31 --port 49840"
time=2025-04-11T12:12:45.364+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-11T12:12:45.365+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-11T12:12:45.369+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-11T12:12:45.397+02:00 level=INFO source=runner.go:821 msg="starting ollama engine"
time=2025-04-11T12:12:45.403+02:00 level=INFO source=runner.go:884 msg="Server listening on 127.0.0.1:49840"
time=2025-04-11T12:12:45.554+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
time=2025-04-11T12:12:45.554+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-04-11T12:12:45.555+02:00 level=INFO source=ggml.go:66 msg="" architecture=gemma3 file_type=F16 name="" description="" num_tensors=1247 num_key_values=37
time=2025-04-11T12:12:45.624+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA A40-48Q, compute capability 8.6, VMM: no
  Device 1: NVIDIA A40-48Q, compute capability 8.6, VMM: no
load_backend: loaded CUDA backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-04-11T12:12:46.925+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-04-11T12:12:52.226+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CPU size="2.6 GiB"
time=2025-04-11T12:12:52.227+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA0 size="24.6 GiB"
time=2025-04-11T12:12:52.227+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA1 size="26.5 GiB"
time=2025-04-11T12:23:56.744+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
time=2025-04-11T12:23:56.744+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA1 buffer_type=CUDA1
time=2025-04-11T12:23:56.744+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CPU buffer_type=CUDA_Host
time=2025-04-11T12:23:56.756+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-11T12:23:56.777+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-11T12:23:56.777+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-11T12:23:56.777+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-11T12:23:56.777+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-11T12:23:56.777+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-11T12:23:56.859+02:00 level=INFO source=server.go:619 msg="llama runner started in 671.49 seconds"
[GIN] 2025/04/11 - 12:24:16 | 200 |        11m31s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:17 | 200 |        11m33s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:20 | 200 |        11m35s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:24 | 200 |        11m39s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:27 | 200 |        11m42s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:34 | 200 |   10.4845703s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:35 | 200 |    7.6424863s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:40 | 200 |    23.947212s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:45 | 200 |         12m0s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:47 | 200 |    12.391939s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:49 | 200 |   14.5998317s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:49 | 200 |   29.4635917s |    10.112.4.104 | POST     "/api/chat"
ggml.c:1584: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
[GIN] 2025/04/11 - 12:24:51 | 200 |    3.7318963s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:51 | 200 |    6.3232277s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:51 | 200 |    1.8785323s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:51 | 200 |    2.0270088s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:51 | 200 |   33.7110512s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:24:51 | 200 |   11.4037942s |    10.112.4.104 | POST     "/api/chat"
time=2025-04-11T12:24:51.763+02:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409"
time=2025-04-11T12:24:56.583+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0324498 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25
time=2025-04-11T12:24:56.797+02:00 level=INFO source=server.go:105 msg="system memory" total="256.0 GiB" free="250.2 GiB" free_swap="284.6 GiB"
time=2025-04-11T12:24:56.800+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=999 layers.model=63 layers.offload=63 layers.split=32,31 memory.available="[43.0 GiB 42.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="78.0 GiB" memory.required.partial="78.0 GiB" memory.required.kv="10.1 GiB" memory.required.allocations="[37.7 GiB 40.3 GiB]" memory.weights.total="50.3 GiB" memory.weights.repeating="47.7 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="6.6 GiB" memory.graph.partial="6.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-04-11T12:24:56.833+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2824301 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25
time=2025-04-11T12:24:56.956+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-11T12:24:56.965+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-11T12:24:56.965+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-11T12:24:56.965+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-11T12:24:56.965+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-11T12:24:56.965+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-11T12:24:56.967+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\devops01\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\devops01\\.ollama\\models\\blobs\\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 --ctx-size 98304 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 6 --tensor-split 32,31 --port 49854"
time=2025-04-11T12:24:56.973+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-11T12:24:56.973+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-11T12:24:56.974+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-11T12:24:57.007+02:00 level=INFO source=runner.go:821 msg="starting ollama engine"
time=2025-04-11T12:24:57.008+02:00 level=INFO source=runner.go:884 msg="Server listening on 127.0.0.1:49854"
time=2025-04-11T12:24:57.085+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5345044 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25
time=2025-04-11T12:24:57.169+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
time=2025-04-11T12:24:57.169+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-04-11T12:24:57.169+02:00 level=INFO source=ggml.go:66 msg="" architecture=gemma3 file_type=F16 name="" description="" num_tensors=1247 num_key_values=37
time=2025-04-11T12:24:57.226+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA A40-48Q, compute capability 8.6, VMM: no
  Device 1: NVIDIA A40-48Q, compute capability 8.6, VMM: no
load_backend: loaded CUDA backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-04-11T12:24:57.307+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-04-11T12:24:58.354+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA0 size="24.6 GiB"
time=2025-04-11T12:24:58.354+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA1 size="26.5 GiB"
time=2025-04-11T12:24:58.354+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CPU size="2.6 GiB"
time=2025-04-11T12:25:31.071+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
time=2025-04-11T12:25:31.071+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA1 buffer_type=CUDA1
time=2025-04-11T12:25:31.071+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CPU buffer_type=CUDA_Host
time=2025-04-11T12:25:31.074+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-11T12:25:31.081+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-11T12:25:31.081+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-11T12:25:31.081+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-11T12:25:31.081+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-11T12:25:31.081+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-11T12:25:31.184+02:00 level=INFO source=server.go:619 msg="llama runner started in 34.21 seconds"
[GIN] 2025/04/11 - 12:25:44 | 200 |   53.1332559s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:25:50 | 200 |   59.3116781s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:25:53 | 200 |    8.8200168s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:25:55 | 200 |          1m4s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:26:00 | 200 |    9.4612141s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:26:00 | 200 |          1m9s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:26:03 | 200 |   10.3084955s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:26:08 | 200 |         1m17s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:26:08 | 200 |   13.2767995s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:26:11 | 200 |         1m20s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:26:19 | 200 |    7.4841954s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:26:26 | 200 |   22.2060527s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:26:31 | 200 |   30.6599233s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:26:33 | 200 |   24.7328874s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:26:35 | 200 |    9.8252355s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:26:46 | 200 |   27.5309845s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:26:49 | 200 |   49.3753811s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:26:52 | 200 |     19.02995s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:27:00 | 200 |   51.2895701s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:27:06 | 200 |   21.9946095s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:27:09 | 200 |   19.9151862s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:27:20 | 200 |   44.4700658s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:27:29 | 200 |   28.6696906s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:27:34 | 200 |          1m2s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:27:37 | 200 |   44.7015548s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:27:45 | 200 |   35.6019499s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:27:50 | 200 |   43.4674596s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:28:04 | 200 |   19.3394946s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:28:05 | 200 |   30.9092248s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:28:11 | 200 |   42.2922658s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:28:11 | 200 |   34.3488465s |    10.112.4.104 | POST     "/api/chat"
ggml.c:1584: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
[GIN] 2025/04/11 - 12:28:13 | 200 |    1.7373487s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:28:13 | 200 |    8.1296582s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:28:13 | 200 |    8.5396397s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:28:13 | 200 |   54.7406108s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:28:13 | 200 |    1.8438771s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:28:13 | 200 |    22.835188s |    10.112.4.104 | POST     "/api/chat"
time=2025-04-11T12:28:13.637+02:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409"
time=2025-04-11T12:28:18.324+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0308059 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25
time=2025-04-11T12:28:18.532+02:00 level=INFO source=server.go:105 msg="system memory" total="256.0 GiB" free="250.1 GiB" free_swap="284.7 GiB"
time=2025-04-11T12:28:18.535+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=999 layers.model=63 layers.offload=63 layers.split=32,31 memory.available="[43.0 GiB 42.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="78.0 GiB" memory.required.partial="78.0 GiB" memory.required.kv="10.1 GiB" memory.required.allocations="[37.7 GiB 40.3 GiB]" memory.weights.total="50.3 GiB" memory.weights.repeating="47.7 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="6.6 GiB" memory.graph.partial="6.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-04-11T12:28:18.574+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2812684 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25
time=2025-04-11T12:28:18.683+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-11T12:28:18.696+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-11T12:28:18.696+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-11T12:28:18.696+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-11T12:28:18.696+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-11T12:28:18.696+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-11T12:28:18.698+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\devops01\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\devops01\\.ollama\\models\\blobs\\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 --ctx-size 98304 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 6 --tensor-split 32,31 --port 49866"
time=2025-04-11T12:28:18.706+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-11T12:28:18.706+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-11T12:28:18.706+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-11T12:28:18.736+02:00 level=INFO source=runner.go:821 msg="starting ollama engine"
time=2025-04-11T12:28:18.738+02:00 level=INFO source=runner.go:884 msg="Server listening on 127.0.0.1:49866"
time=2025-04-11T12:28:18.824+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5313371 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25
time=2025-04-11T12:28:18.885+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
time=2025-04-11T12:28:18.885+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-04-11T12:28:18.885+02:00 level=INFO source=ggml.go:66 msg="" architecture=gemma3 file_type=F16 name="" description="" num_tensors=1247 num_key_values=37
time=2025-04-11T12:28:18.958+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA A40-48Q, compute capability 8.6, VMM: no
  Device 1: NVIDIA A40-48Q, compute capability 8.6, VMM: no
load_backend: loaded CUDA backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-04-11T12:28:19.012+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-04-11T12:28:20.086+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA0 size="24.6 GiB"
time=2025-04-11T12:28:20.086+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA1 size="26.5 GiB"
time=2025-04-11T12:28:20.086+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CPU size="2.6 GiB"
time=2025-04-11T12:28:51.790+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
time=2025-04-11T12:28:51.790+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA1 buffer_type=CUDA1
time=2025-04-11T12:28:51.790+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CPU buffer_type=CUDA_Host
time=2025-04-11T12:28:51.794+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-11T12:28:51.804+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-11T12:28:51.804+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-11T12:28:51.804+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-11T12:28:51.804+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-11T12:28:51.804+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-11T12:28:51.920+02:00 level=INFO source=server.go:619 msg="llama runner started in 33.21 seconds"
[GIN] 2025/04/11 - 12:29:16 | 200 |          1m3s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:29:17 | 200 |          1m3s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:29:20 | 200 |          1m7s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:29:21 | 200 |          1m8s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:29:31 | 200 |         1m18s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:29:46 | 200 |   29.3217937s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:29:48 | 200 |         1m35s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:29:53 | 200 |    24.305176s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:29:55 | 200 |   38.9044245s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:29:55 | 200 |    35.622749s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:30:07 | 200 |   47.8223514s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:30:10 | 200 |   24.0901157s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:30:15 | 200 |   26.5035104s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:30:19 | 200 |   25.3068109s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:30:28 | 200 |   17.7198365s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:30:36 | 200 |   17.0471386s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:30:42 | 200 |   46.2424021s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:30:46 | 200 |   52.7859945s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:30:50 | 200 |   37.2662657s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:30:53 | 200 |   46.0491668s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:31:07 | 200 |   25.0693838s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:31:12 | 200 |   25.8484002s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:31:15 | 200 |   21.6213194s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:31:20 | 200 |   44.0849389s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:31:24 | 200 |   56.6358737s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:31:30 | 200 |   25.5136271s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:31:34 | 200 |   46.2709158s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:31:42 | 200 |   27.6361706s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:31:52 | 200 |   27.5678475s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:31:58 | 200 |   27.5641811s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:31:59 | 200 |   19.0255138s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:32:14 | 200 |   54.4350914s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:32:21 | 200 |   47.1807956s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:32:24 | 200 |         1m12s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:32:31 | 200 |   40.5839697s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:32:34 | 200 |   36.1990228s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:32:53 | 200 |   53.1846597s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:32:58 | 200 |   33.2789358s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:33:02 | 200 |   48.0004041s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:33:10 | 200 |   48.7998456s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:33:15 | 200 |   22.2831035s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:33:18 | 200 |   47.1468018s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:33:34 | 200 |   18.6788824s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:33:43 | 200 |   40.6046839s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:33:50 | 200 |         1m15s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:33:53 | 200 |   10.0716928s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:33:57 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
[GIN] 2025/04/11 - 12:33:59 | 200 |   25.3622131s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:01 | 200 |   11.4095702s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:07 | 200 |    5.6965056s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:08 | 200 |   50.2529789s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:13 | 200 |    6.3515579s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:13 | 200 |        18.2µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/04/11 - 12:34:13 | 200 |       549.5µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/04/11 - 12:34:18 | 200 |   24.7910725s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:20 | 200 |   11.2859414s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:24 | 200 |         1m25s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:24 | 200 |         1m13s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:29 | 200 |   15.6519589s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:30 | 200 |   10.6482381s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:34 | 200 |   15.6839368s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:37 | 200 |   13.0729204s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:40 | 200 |    9.8388866s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:46 | 200 |   22.4742719s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:47 | 200 |   47.8380304s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:56 | 200 |   15.7782414s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:56 | 200 |    9.8823511s |    10.112.4.104 | POST     "/api/chat"
ggml.c:1584: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
[GIN] 2025/04/11 - 12:34:58 | 200 |    2.0091783s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:58 | 200 |    13.041289s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:58 | 200 |    2.0050585s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:58 | 200 |   21.0113455s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:58 | 200 |    24.364639s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:34:58 | 200 |   31.1420179s |    10.112.4.104 | POST     "/api/chat"
time=2025-04-11T12:34:59.093+02:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409"
time=2025-04-11T12:35:03.559+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0334434 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25
time=2025-04-11T12:35:03.766+02:00 level=INFO source=server.go:105 msg="system memory" total="256.0 GiB" free="250.1 GiB" free_swap="284.7 GiB"
time=2025-04-11T12:35:03.773+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=999 layers.model=63 layers.offload=63 layers.split=32,31 memory.available="[43.0 GiB 42.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="78.0 GiB" memory.required.partial="78.0 GiB" memory.required.kv="10.1 GiB" memory.required.allocations="[37.7 GiB 40.3 GiB]" memory.weights.total="50.3 GiB" memory.weights.repeating="47.7 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="6.6 GiB" memory.graph.partial="6.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-04-11T12:35:03.809+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2836285 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25
time=2025-04-11T12:35:03.903+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-11T12:35:03.914+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-11T12:35:03.914+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-11T12:35:03.914+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-11T12:35:03.914+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-11T12:35:03.914+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-11T12:35:03.916+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\devops01\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\devops01\\.ollama\\models\\blobs\\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 --ctx-size 98304 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 6 --tensor-split 32,31 --port 49887"
time=2025-04-11T12:35:03.921+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-11T12:35:03.922+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-11T12:35:03.922+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-11T12:35:03.949+02:00 level=INFO source=runner.go:821 msg="starting ollama engine"
time=2025-04-11T12:35:03.951+02:00 level=INFO source=runner.go:884 msg="Server listening on 127.0.0.1:49887"
time=2025-04-11T12:35:04.059+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5331634 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25
time=2025-04-11T12:35:04.106+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
time=2025-04-11T12:35:04.106+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-04-11T12:35:04.106+02:00 level=INFO source=ggml.go:66 msg="" architecture=gemma3 file_type=F16 name="" description="" num_tensors=1247 num_key_values=37
time=2025-04-11T12:35:04.175+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA A40-48Q, compute capability 8.6, VMM: no
  Device 1: NVIDIA A40-48Q, compute capability 8.6, VMM: no
load_backend: loaded CUDA backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-04-11T12:35:04.254+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-04-11T12:35:05.336+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA0 size="24.6 GiB"
time=2025-04-11T12:35:05.336+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA1 size="26.5 GiB"
time=2025-04-11T12:35:05.336+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CPU size="2.6 GiB"
time=2025-04-11T12:35:36.278+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
time=2025-04-11T12:35:36.279+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA1 buffer_type=CUDA1
time=2025-04-11T12:35:36.279+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CPU buffer_type=CUDA_Host
time=2025-04-11T12:35:36.282+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-11T12:35:36.290+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-11T12:35:36.290+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-11T12:35:36.290+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-11T12:35:36.290+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-11T12:35:36.290+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-11T12:35:36.490+02:00 level=INFO source=server.go:619 msg="llama runner started in 32.57 seconds"
[GIN] 2025/04/11 - 12:35:50 | 200 |   51.9087955s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:35:52 | 200 |   53.8742062s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:35:56 | 200 |   58.3981315s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:35:59 | 200 |    9.4255731s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:36:09 | 200 |         1m11s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:36:12 | 200 |         1m13s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:36:15 | 200 |    22.754484s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:36:20 | 200 |         1m21s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:36:21 | 200 |   21.7467929s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:36:21 | 200 |   12.0151343s |    10.112.4.104 | POST     "/api/chat"
ggml.c:1584: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
[GIN] 2025/04/11 - 12:36:23 | 200 |    8.3181135s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:36:23 | 200 |    1.6745798s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:36:23 | 200 |    1.8272246s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:36:23 | 200 |   26.5833756s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:36:23 | 200 |   11.3937013s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 12:36:23 | 200 |    3.4645853s |    10.112.4.104 | POST     "/api/chat"
time=2025-04-11T12:36:23.817+02:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409"
time=2025-04-11T12:36:28.622+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0296166 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25
time=2025-04-11T12:36:28.835+02:00 level=INFO source=server.go:105 msg="system memory" total="256.0 GiB" free="250.1 GiB" free_swap="284.7 GiB"
time=2025-04-11T12:36:28.837+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=999 layers.model=63 layers.offload=63 layers.split=32,31 memory.available="[43.0 GiB 42.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="78.0 GiB" memory.required.partial="78.0 GiB" memory.required.kv="10.1 GiB" memory.required.allocations="[37.7 GiB 40.3 GiB]" memory.weights.total="50.3 GiB" memory.weights.repeating="47.7 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="6.6 GiB" memory.graph.partial="6.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-04-11T12:36:28.873+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2801325 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25
time=2025-04-11T12:36:28.972+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-11T12:36:28.989+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-11T12:36:28.989+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-11T12:36:28.989+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-11T12:36:28.989+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-11T12:36:28.989+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-11T12:36:28.990+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\devops01\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\devops01\\.ollama\\models\\blobs\\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 --ctx-size 98304 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 6 --tensor-split 32,31 --port 49899"
time=2025-04-11T12:36:28.998+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-11T12:36:28.998+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-11T12:36:29.000+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-11T12:36:29.024+02:00 level=INFO source=runner.go:821 msg="starting ollama engine"
time=2025-04-11T12:36:29.025+02:00 level=INFO source=runner.go:884 msg="Server listening on 127.0.0.1:49899"
time=2025-04-11T12:36:29.127+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5348366 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25
time=2025-04-11T12:36:29.157+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
time=2025-04-11T12:36:29.157+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-04-11T12:36:29.157+02:00 level=INFO source=ggml.go:66 msg="" architecture=gemma3 file_type=F16 name="" description="" num_tensors=1247 num_key_values=37
time=2025-04-11T12:36:29.252+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA A40-48Q, compute capability 8.6, VMM: no
  Device 1: NVIDIA A40-48Q, compute capability 8.6, VMM: no
load_backend: loaded CUDA backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-04-11T12:36:29.303+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-04-11T12:36:30.377+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA0 size="24.6 GiB"
time=2025-04-11T12:36:30.377+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA1 size="26.5 GiB"
time=2025-04-11T12:36:30.377+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CPU size="2.6 GiB"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.6.4

Originally created by @forReason on GitHub (Apr 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10233 ### What is the issue? **Summary:** The Ollama runner intermittently crashes with a low-level GGML assertion error when using `OLLAMA_NUM_PARALLEL=6` and `OLLAMA_SHAREDMEM=1`. logs show repeated assertion failures, GPU memory not releasing, and forcibly closed socket connections. --- ### ✅ Environment | Key | Value | |----------------------|--------------------------------| | Model | * gemma:7b or custom)](gemma3:27b-it-fp16* | | Backend | CUDA (2x NVIDIA A40-48Q, 48GB) | | OLLAMA_NUM_PARALLEL | 6 | | OLLAMA_SHAREDMEM | 1 | --- ### ❗ Error Messages #### 💥 GGML Assertion: ``` ggml.c:1584: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed ``` #### ❌ Server Termination: ``` source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409" ``` #### 🔄 Repeated API Failures: ```json "error": "an error was encountered while running the model: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed" "error": "POST predict: ... wsarecv: An existing connection was forcibly closed by the remote host." ``` 📉 Observed Behavior The error does not happen immediately, but appears after several generations, especially under load. GPU usage per card hovers around 32–34 GB out of 48 GB, but crashes still occur. Shared memory seems to increase from 0.1 -2 gb until crash ### Relevant log output ```shell 2025/04/11 12:11:05 routes.go:1231: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\devops01\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:6 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-04-11T12:11:05.582+02:00 level=INFO source=images.go:458 msg="total blobs: 17" time=2025-04-11T12:11:05.584+02:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0" time=2025-04-11T12:11:05.596+02:00 level=INFO source=routes.go:1298 msg="Listening on [::]:11434 (version 0.6.4)" time=2025-04-11T12:11:05.600+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-04-11T12:11:05.603+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=4 time=2025-04-11T12:11:05.603+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=2 efficiency=0 threads=2 time=2025-04-11T12:11:05.603+02:00 level=INFO source=gpu_windows.go:214 msg="" package=1 cores=2 efficiency=0 threads=2 time=2025-04-11T12:11:05.603+02:00 level=INFO source=gpu_windows.go:214 msg="" package=2 cores=2 efficiency=0 threads=2 time=2025-04-11T12:11:05.603+02:00 level=INFO source=gpu_windows.go:214 msg="" package=3 cores=2 efficiency=0 threads=2 time=2025-04-11T12:11:06.304+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-cb794310-3913-11b2-bd79-c22e9e69f4a7 library=cuda compute=8.6 driver=11.4 name="NVIDIA A40-48Q" overhead="774.6 MiB" time=2025-04-11T12:11:06.573+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-ce366d7e-3913-11b2-9e88-48d7ce79eeca library=cuda compute=8.6 driver=11.4 name="NVIDIA A40-48Q" overhead="1000.6 MiB" time=2025-04-11T12:11:06.577+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-cb794310-3913-11b2-bd79-c22e9e69f4a7 library=cuda variant=v11 compute=8.6 driver=11.4 name="NVIDIA A40-48Q" total="48.0 GiB" available="42.9 GiB" time=2025-04-11T12:11:06.577+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ce366d7e-3913-11b2-9e88-48d7ce79eeca library=cuda variant=v11 compute=8.6 driver=11.4 name="NVIDIA A40-48Q" total="48.0 GiB" available="42.9 GiB" time=2025-04-11T12:12:45.120+02:00 level=INFO source=server.go:105 msg="system memory" total="256.0 GiB" free="250.6 GiB" free_swap="285.1 GiB" time=2025-04-11T12:12:45.123+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=999 layers.model=63 layers.offload=63 layers.split=32,31 memory.available="[43.0 GiB 42.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="78.0 GiB" memory.required.partial="78.0 GiB" memory.required.kv="10.1 GiB" memory.required.allocations="[37.7 GiB 40.3 GiB]" memory.weights.total="50.3 GiB" memory.weights.repeating="47.7 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="6.6 GiB" memory.graph.partial="6.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" time=2025-04-11T12:12:45.279+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-11T12:12:45.290+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-04-11T12:12:45.291+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-11T12:12:45.291+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-11T12:12:45.291+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-11T12:12:45.291+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-11T12:12:45.303+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\devops01\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\devops01\\.ollama\\models\\blobs\\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 --ctx-size 98304 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 6 --tensor-split 32,31 --port 49840" time=2025-04-11T12:12:45.364+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-11T12:12:45.365+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-11T12:12:45.369+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-11T12:12:45.397+02:00 level=INFO source=runner.go:821 msg="starting ollama engine" time=2025-04-11T12:12:45.403+02:00 level=INFO source=runner.go:884 msg="Server listening on 127.0.0.1:49840" time=2025-04-11T12:12:45.554+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" time=2025-04-11T12:12:45.554+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-04-11T12:12:45.555+02:00 level=INFO source=ggml.go:66 msg="" architecture=gemma3 file_type=F16 name="" description="" num_tensors=1247 num_key_values=37 time=2025-04-11T12:12:45.624+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA A40-48Q, compute capability 8.6, VMM: no Device 1: NVIDIA A40-48Q, compute capability 8.6, VMM: no load_backend: loaded CUDA backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-04-11T12:12:46.925+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-04-11T12:12:52.226+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CPU size="2.6 GiB" time=2025-04-11T12:12:52.227+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA0 size="24.6 GiB" time=2025-04-11T12:12:52.227+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA1 size="26.5 GiB" time=2025-04-11T12:23:56.744+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 time=2025-04-11T12:23:56.744+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 time=2025-04-11T12:23:56.744+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CPU buffer_type=CUDA_Host time=2025-04-11T12:23:56.756+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-11T12:23:56.777+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-04-11T12:23:56.777+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-11T12:23:56.777+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-11T12:23:56.777+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-11T12:23:56.777+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-11T12:23:56.859+02:00 level=INFO source=server.go:619 msg="llama runner started in 671.49 seconds" [GIN] 2025/04/11 - 12:24:16 | 200 | 11m31s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:17 | 200 | 11m33s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:20 | 200 | 11m35s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:24 | 200 | 11m39s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:27 | 200 | 11m42s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:34 | 200 | 10.4845703s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:35 | 200 | 7.6424863s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:40 | 200 | 23.947212s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:45 | 200 | 12m0s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:47 | 200 | 12.391939s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:49 | 200 | 14.5998317s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:49 | 200 | 29.4635917s | 10.112.4.104 | POST "/api/chat" ggml.c:1584: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed [GIN] 2025/04/11 - 12:24:51 | 200 | 3.7318963s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:51 | 200 | 6.3232277s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:51 | 200 | 1.8785323s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:51 | 200 | 2.0270088s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:51 | 200 | 33.7110512s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:24:51 | 200 | 11.4037942s | 10.112.4.104 | POST "/api/chat" time=2025-04-11T12:24:51.763+02:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409" time=2025-04-11T12:24:56.583+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0324498 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 time=2025-04-11T12:24:56.797+02:00 level=INFO source=server.go:105 msg="system memory" total="256.0 GiB" free="250.2 GiB" free_swap="284.6 GiB" time=2025-04-11T12:24:56.800+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=999 layers.model=63 layers.offload=63 layers.split=32,31 memory.available="[43.0 GiB 42.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="78.0 GiB" memory.required.partial="78.0 GiB" memory.required.kv="10.1 GiB" memory.required.allocations="[37.7 GiB 40.3 GiB]" memory.weights.total="50.3 GiB" memory.weights.repeating="47.7 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="6.6 GiB" memory.graph.partial="6.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" time=2025-04-11T12:24:56.833+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2824301 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 time=2025-04-11T12:24:56.956+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-11T12:24:56.965+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-04-11T12:24:56.965+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-11T12:24:56.965+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-11T12:24:56.965+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-11T12:24:56.965+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-11T12:24:56.967+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\devops01\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\devops01\\.ollama\\models\\blobs\\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 --ctx-size 98304 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 6 --tensor-split 32,31 --port 49854" time=2025-04-11T12:24:56.973+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-11T12:24:56.973+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-11T12:24:56.974+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-11T12:24:57.007+02:00 level=INFO source=runner.go:821 msg="starting ollama engine" time=2025-04-11T12:24:57.008+02:00 level=INFO source=runner.go:884 msg="Server listening on 127.0.0.1:49854" time=2025-04-11T12:24:57.085+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5345044 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 time=2025-04-11T12:24:57.169+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" time=2025-04-11T12:24:57.169+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-04-11T12:24:57.169+02:00 level=INFO source=ggml.go:66 msg="" architecture=gemma3 file_type=F16 name="" description="" num_tensors=1247 num_key_values=37 time=2025-04-11T12:24:57.226+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA A40-48Q, compute capability 8.6, VMM: no Device 1: NVIDIA A40-48Q, compute capability 8.6, VMM: no load_backend: loaded CUDA backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-04-11T12:24:57.307+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-04-11T12:24:58.354+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA0 size="24.6 GiB" time=2025-04-11T12:24:58.354+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA1 size="26.5 GiB" time=2025-04-11T12:24:58.354+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CPU size="2.6 GiB" time=2025-04-11T12:25:31.071+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 time=2025-04-11T12:25:31.071+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 time=2025-04-11T12:25:31.071+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CPU buffer_type=CUDA_Host time=2025-04-11T12:25:31.074+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-11T12:25:31.081+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-04-11T12:25:31.081+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-11T12:25:31.081+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-11T12:25:31.081+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-11T12:25:31.081+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-11T12:25:31.184+02:00 level=INFO source=server.go:619 msg="llama runner started in 34.21 seconds" [GIN] 2025/04/11 - 12:25:44 | 200 | 53.1332559s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:25:50 | 200 | 59.3116781s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:25:53 | 200 | 8.8200168s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:25:55 | 200 | 1m4s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:26:00 | 200 | 9.4612141s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:26:00 | 200 | 1m9s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:26:03 | 200 | 10.3084955s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:26:08 | 200 | 1m17s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:26:08 | 200 | 13.2767995s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:26:11 | 200 | 1m20s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:26:19 | 200 | 7.4841954s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:26:26 | 200 | 22.2060527s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:26:31 | 200 | 30.6599233s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:26:33 | 200 | 24.7328874s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:26:35 | 200 | 9.8252355s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:26:46 | 200 | 27.5309845s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:26:49 | 200 | 49.3753811s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:26:52 | 200 | 19.02995s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:27:00 | 200 | 51.2895701s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:27:06 | 200 | 21.9946095s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:27:09 | 200 | 19.9151862s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:27:20 | 200 | 44.4700658s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:27:29 | 200 | 28.6696906s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:27:34 | 200 | 1m2s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:27:37 | 200 | 44.7015548s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:27:45 | 200 | 35.6019499s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:27:50 | 200 | 43.4674596s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:28:04 | 200 | 19.3394946s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:28:05 | 200 | 30.9092248s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:28:11 | 200 | 42.2922658s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:28:11 | 200 | 34.3488465s | 10.112.4.104 | POST "/api/chat" ggml.c:1584: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed [GIN] 2025/04/11 - 12:28:13 | 200 | 1.7373487s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:28:13 | 200 | 8.1296582s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:28:13 | 200 | 8.5396397s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:28:13 | 200 | 54.7406108s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:28:13 | 200 | 1.8438771s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:28:13 | 200 | 22.835188s | 10.112.4.104 | POST "/api/chat" time=2025-04-11T12:28:13.637+02:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409" time=2025-04-11T12:28:18.324+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0308059 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 time=2025-04-11T12:28:18.532+02:00 level=INFO source=server.go:105 msg="system memory" total="256.0 GiB" free="250.1 GiB" free_swap="284.7 GiB" time=2025-04-11T12:28:18.535+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=999 layers.model=63 layers.offload=63 layers.split=32,31 memory.available="[43.0 GiB 42.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="78.0 GiB" memory.required.partial="78.0 GiB" memory.required.kv="10.1 GiB" memory.required.allocations="[37.7 GiB 40.3 GiB]" memory.weights.total="50.3 GiB" memory.weights.repeating="47.7 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="6.6 GiB" memory.graph.partial="6.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" time=2025-04-11T12:28:18.574+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2812684 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 time=2025-04-11T12:28:18.683+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-11T12:28:18.696+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-04-11T12:28:18.696+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-11T12:28:18.696+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-11T12:28:18.696+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-11T12:28:18.696+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-11T12:28:18.698+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\devops01\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\devops01\\.ollama\\models\\blobs\\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 --ctx-size 98304 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 6 --tensor-split 32,31 --port 49866" time=2025-04-11T12:28:18.706+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-11T12:28:18.706+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-11T12:28:18.706+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-11T12:28:18.736+02:00 level=INFO source=runner.go:821 msg="starting ollama engine" time=2025-04-11T12:28:18.738+02:00 level=INFO source=runner.go:884 msg="Server listening on 127.0.0.1:49866" time=2025-04-11T12:28:18.824+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5313371 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 time=2025-04-11T12:28:18.885+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" time=2025-04-11T12:28:18.885+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-04-11T12:28:18.885+02:00 level=INFO source=ggml.go:66 msg="" architecture=gemma3 file_type=F16 name="" description="" num_tensors=1247 num_key_values=37 time=2025-04-11T12:28:18.958+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA A40-48Q, compute capability 8.6, VMM: no Device 1: NVIDIA A40-48Q, compute capability 8.6, VMM: no load_backend: loaded CUDA backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-04-11T12:28:19.012+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-04-11T12:28:20.086+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA0 size="24.6 GiB" time=2025-04-11T12:28:20.086+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA1 size="26.5 GiB" time=2025-04-11T12:28:20.086+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CPU size="2.6 GiB" time=2025-04-11T12:28:51.790+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 time=2025-04-11T12:28:51.790+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 time=2025-04-11T12:28:51.790+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CPU buffer_type=CUDA_Host time=2025-04-11T12:28:51.794+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-11T12:28:51.804+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-04-11T12:28:51.804+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-11T12:28:51.804+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-11T12:28:51.804+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-11T12:28:51.804+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-11T12:28:51.920+02:00 level=INFO source=server.go:619 msg="llama runner started in 33.21 seconds" [GIN] 2025/04/11 - 12:29:16 | 200 | 1m3s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:29:17 | 200 | 1m3s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:29:20 | 200 | 1m7s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:29:21 | 200 | 1m8s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:29:31 | 200 | 1m18s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:29:46 | 200 | 29.3217937s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:29:48 | 200 | 1m35s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:29:53 | 200 | 24.305176s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:29:55 | 200 | 38.9044245s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:29:55 | 200 | 35.622749s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:30:07 | 200 | 47.8223514s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:30:10 | 200 | 24.0901157s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:30:15 | 200 | 26.5035104s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:30:19 | 200 | 25.3068109s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:30:28 | 200 | 17.7198365s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:30:36 | 200 | 17.0471386s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:30:42 | 200 | 46.2424021s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:30:46 | 200 | 52.7859945s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:30:50 | 200 | 37.2662657s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:30:53 | 200 | 46.0491668s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:31:07 | 200 | 25.0693838s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:31:12 | 200 | 25.8484002s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:31:15 | 200 | 21.6213194s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:31:20 | 200 | 44.0849389s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:31:24 | 200 | 56.6358737s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:31:30 | 200 | 25.5136271s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:31:34 | 200 | 46.2709158s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:31:42 | 200 | 27.6361706s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:31:52 | 200 | 27.5678475s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:31:58 | 200 | 27.5641811s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:31:59 | 200 | 19.0255138s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:32:14 | 200 | 54.4350914s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:32:21 | 200 | 47.1807956s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:32:24 | 200 | 1m12s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:32:31 | 200 | 40.5839697s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:32:34 | 200 | 36.1990228s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:32:53 | 200 | 53.1846597s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:32:58 | 200 | 33.2789358s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:33:02 | 200 | 48.0004041s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:33:10 | 200 | 48.7998456s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:33:15 | 200 | 22.2831035s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:33:18 | 200 | 47.1468018s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:33:34 | 200 | 18.6788824s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:33:43 | 200 | 40.6046839s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:33:50 | 200 | 1m15s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:33:53 | 200 | 10.0716928s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:33:57 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2025/04/11 - 12:33:59 | 200 | 25.3622131s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:01 | 200 | 11.4095702s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:07 | 200 | 5.6965056s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:08 | 200 | 50.2529789s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:13 | 200 | 6.3515579s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:13 | 200 | 18.2µs | 127.0.0.1 | HEAD "/" [GIN] 2025/04/11 - 12:34:13 | 200 | 549.5µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/04/11 - 12:34:18 | 200 | 24.7910725s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:20 | 200 | 11.2859414s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:24 | 200 | 1m25s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:24 | 200 | 1m13s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:29 | 200 | 15.6519589s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:30 | 200 | 10.6482381s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:34 | 200 | 15.6839368s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:37 | 200 | 13.0729204s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:40 | 200 | 9.8388866s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:46 | 200 | 22.4742719s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:47 | 200 | 47.8380304s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:56 | 200 | 15.7782414s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:56 | 200 | 9.8823511s | 10.112.4.104 | POST "/api/chat" ggml.c:1584: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed [GIN] 2025/04/11 - 12:34:58 | 200 | 2.0091783s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:58 | 200 | 13.041289s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:58 | 200 | 2.0050585s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:58 | 200 | 21.0113455s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:58 | 200 | 24.364639s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:34:58 | 200 | 31.1420179s | 10.112.4.104 | POST "/api/chat" time=2025-04-11T12:34:59.093+02:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409" time=2025-04-11T12:35:03.559+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0334434 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 time=2025-04-11T12:35:03.766+02:00 level=INFO source=server.go:105 msg="system memory" total="256.0 GiB" free="250.1 GiB" free_swap="284.7 GiB" time=2025-04-11T12:35:03.773+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=999 layers.model=63 layers.offload=63 layers.split=32,31 memory.available="[43.0 GiB 42.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="78.0 GiB" memory.required.partial="78.0 GiB" memory.required.kv="10.1 GiB" memory.required.allocations="[37.7 GiB 40.3 GiB]" memory.weights.total="50.3 GiB" memory.weights.repeating="47.7 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="6.6 GiB" memory.graph.partial="6.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" time=2025-04-11T12:35:03.809+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2836285 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 time=2025-04-11T12:35:03.903+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-11T12:35:03.914+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-04-11T12:35:03.914+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-11T12:35:03.914+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-11T12:35:03.914+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-11T12:35:03.914+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-11T12:35:03.916+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\devops01\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\devops01\\.ollama\\models\\blobs\\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 --ctx-size 98304 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 6 --tensor-split 32,31 --port 49887" time=2025-04-11T12:35:03.921+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-11T12:35:03.922+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-11T12:35:03.922+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-11T12:35:03.949+02:00 level=INFO source=runner.go:821 msg="starting ollama engine" time=2025-04-11T12:35:03.951+02:00 level=INFO source=runner.go:884 msg="Server listening on 127.0.0.1:49887" time=2025-04-11T12:35:04.059+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5331634 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 time=2025-04-11T12:35:04.106+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" time=2025-04-11T12:35:04.106+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-04-11T12:35:04.106+02:00 level=INFO source=ggml.go:66 msg="" architecture=gemma3 file_type=F16 name="" description="" num_tensors=1247 num_key_values=37 time=2025-04-11T12:35:04.175+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA A40-48Q, compute capability 8.6, VMM: no Device 1: NVIDIA A40-48Q, compute capability 8.6, VMM: no load_backend: loaded CUDA backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-04-11T12:35:04.254+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-04-11T12:35:05.336+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA0 size="24.6 GiB" time=2025-04-11T12:35:05.336+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA1 size="26.5 GiB" time=2025-04-11T12:35:05.336+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CPU size="2.6 GiB" time=2025-04-11T12:35:36.278+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 time=2025-04-11T12:35:36.279+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 time=2025-04-11T12:35:36.279+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CPU buffer_type=CUDA_Host time=2025-04-11T12:35:36.282+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-11T12:35:36.290+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-04-11T12:35:36.290+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-11T12:35:36.290+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-11T12:35:36.290+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-11T12:35:36.290+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-11T12:35:36.490+02:00 level=INFO source=server.go:619 msg="llama runner started in 32.57 seconds" [GIN] 2025/04/11 - 12:35:50 | 200 | 51.9087955s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:35:52 | 200 | 53.8742062s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:35:56 | 200 | 58.3981315s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:35:59 | 200 | 9.4255731s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:36:09 | 200 | 1m11s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:36:12 | 200 | 1m13s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:36:15 | 200 | 22.754484s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:36:20 | 200 | 1m21s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:36:21 | 200 | 21.7467929s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:36:21 | 200 | 12.0151343s | 10.112.4.104 | POST "/api/chat" ggml.c:1584: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed [GIN] 2025/04/11 - 12:36:23 | 200 | 8.3181135s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:36:23 | 200 | 1.6745798s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:36:23 | 200 | 1.8272246s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:36:23 | 200 | 26.5833756s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:36:23 | 200 | 11.3937013s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 12:36:23 | 200 | 3.4645853s | 10.112.4.104 | POST "/api/chat" time=2025-04-11T12:36:23.817+02:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409" time=2025-04-11T12:36:28.622+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0296166 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 time=2025-04-11T12:36:28.835+02:00 level=INFO source=server.go:105 msg="system memory" total="256.0 GiB" free="250.1 GiB" free_swap="284.7 GiB" time=2025-04-11T12:36:28.837+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=999 layers.model=63 layers.offload=63 layers.split=32,31 memory.available="[43.0 GiB 42.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="78.0 GiB" memory.required.partial="78.0 GiB" memory.required.kv="10.1 GiB" memory.required.allocations="[37.7 GiB 40.3 GiB]" memory.weights.total="50.3 GiB" memory.weights.repeating="47.7 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="6.6 GiB" memory.graph.partial="6.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" time=2025-04-11T12:36:28.873+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2801325 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 time=2025-04-11T12:36:28.972+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-11T12:36:28.989+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-04-11T12:36:28.989+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-11T12:36:28.989+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-11T12:36:28.989+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-11T12:36:28.989+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-11T12:36:28.990+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\devops01\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\devops01\\.ollama\\models\\blobs\\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 --ctx-size 98304 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 6 --tensor-split 32,31 --port 49899" time=2025-04-11T12:36:28.998+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-11T12:36:28.998+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-11T12:36:29.000+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-11T12:36:29.024+02:00 level=INFO source=runner.go:821 msg="starting ollama engine" time=2025-04-11T12:36:29.025+02:00 level=INFO source=runner.go:884 msg="Server listening on 127.0.0.1:49899" time=2025-04-11T12:36:29.127+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5348366 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 time=2025-04-11T12:36:29.157+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" time=2025-04-11T12:36:29.157+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-04-11T12:36:29.157+02:00 level=INFO source=ggml.go:66 msg="" architecture=gemma3 file_type=F16 name="" description="" num_tensors=1247 num_key_values=37 time=2025-04-11T12:36:29.252+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA A40-48Q, compute capability 8.6, VMM: no Device 1: NVIDIA A40-48Q, compute capability 8.6, VMM: no load_backend: loaded CUDA backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-04-11T12:36:29.303+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-04-11T12:36:30.377+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA0 size="24.6 GiB" time=2025-04-11T12:36:30.377+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA1 size="26.5 GiB" time=2025-04-11T12:36:30.377+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CPU size="2.6 GiB" ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.4
GiteaMirror added the bug label 2026-04-12 18:27:37 -05:00
Author
Owner

@forReason commented on GitHub (Apr 11, 2025):

looking at the error a second time: view_src == NULL || data_size == 0 i think that the issue might simply be that the model produced no output?

I have somt trouble with the model sometimes producing very sparse output, e.g. a single token such as:

9

I do recognize this but I have not yet found a solution for the issue, perhaps it might be related:

image output was truncated, trying again for the 1th time/
image output was truncated, trying again for the 2th time/
image output was truncated, trying again for the 4th time/
image output was truncated, trying again for the 2th time/
image output was truncated, trying again for the 3th time/
image output was truncated, trying again for the 5th time/
<!-- gh-comment-id:2796548168 --> @forReason commented on GitHub (Apr 11, 2025): looking at the error a second time: `view_src == NULL || data_size == 0` i think that the issue might simply be that the model produced no output? I have somt trouble with the model sometimes producing very sparse output, e.g. a single token such as: ```response 9 ``` I do recognize this but I have not yet found a solution for the issue, perhaps it might be related: ``` image output was truncated, trying again for the 1th time/ image output was truncated, trying again for the 2th time/ image output was truncated, trying again for the 4th time/ image output was truncated, trying again for the 2th time/ image output was truncated, trying again for the 3th time/ image output was truncated, trying again for the 5th time/ ```
Author
Owner

@rick-github commented on GitHub (Apr 11, 2025):

OLLAMA_SHAREDMEM is not an ollama configuration variable. The error occurs in ggml_new_tensor_impl(), ie when it wants to allocate more memory for a new object. Does the error occur if you don't override num_gpu to 999?

<!-- gh-comment-id:2796558073 --> @rick-github commented on GitHub (Apr 11, 2025): `OLLAMA_SHAREDMEM` is not an ollama configuration variable. The error occurs in `ggml_new_tensor_impl()`, ie when it wants to allocate more memory for a new object. Does the error occur if you don't override `num_gpu` to 999?
Author
Owner

@forReason commented on GitHub (Apr 11, 2025):

edit: according to https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900 unified memory should be on per default, thus leading to to a total of ~200 GB shared gpu memory. ~98 GB only GPU

there should be plenty memory free

 nvidia-smi
Fri Apr 11 13:30:47 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 474.14       Driver Version: 474.14       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40-48Q     WDDM  | 00000000:02:01.0  On |                    0 |
| N/A    0C    P0    N/A /  N/A |  36025MiB / 49152MiB |     34%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40-48Q     WDDM  | 00000000:02:02.0 Off |                    0 |
| N/A    0C    P0    N/A /  N/A |  38778MiB / 49152MiB |     51%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      6992    C+G   C:\Windows\explorer.exe         N/A      |
|    0   N/A  N/A      8208    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A      8380    C+G   ...y\ShellExperienceHost.exe    N/A      |
|    0   N/A  N/A      8788      C   ...rograms\Ollama\ollama.exe    N/A      |
|    0   N/A  N/A      9084    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A      9136      C   ...rograms\Ollama\ollama.exe    N/A      |
|    1   N/A  N/A      3048    C+G   ...2txyewy\TextInputHost.exe    N/A      |
|    1   N/A  N/A      7548    C+G   ...5n1h2txyewy\SearchApp.exe    N/A      |
|    1   N/A  N/A      8788      C   ...rograms\Ollama\ollama.exe    N/A      |
|    1   N/A  N/A      9136      C   ...rograms\Ollama\ollama.exe    N/A      |
+-----------------------------------------------------------------------------+

the same issue occurs without num_gpu override (ollama chooses amount layers on gpu)

time=2025-04-11T13:34:25.890+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\devops01\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\devops01\\.ollama\\models\\blobs\\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 --ctx-size 98304 --batch-size 512 --n-gpu-layers 63 --threads 8 --no-mmap --parallel 6 --tensor-split 32,31 --port 59445"
time=2025-04-11T13:34:25.896+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-11T13:34:25.896+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-11T13:34:25.898+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-11T13:34:25.920+02:00 level=INFO source=runner.go:821 msg="starting ollama engine"
time=2025-04-11T13:34:25.922+02:00 level=INFO source=runner.go:884 msg="Server listening on 127.0.0.1:59445"
time=2025-04-11T13:34:25.985+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5307785 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25
time=2025-04-11T13:34:26.072+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
time=2025-04-11T13:34:26.073+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-04-11T13:34:26.073+02:00 level=INFO source=ggml.go:66 msg="" architecture=gemma3 file_type=F16 name="" description="" num_tensors=1247 num_key_values=37
time=2025-04-11T13:34:26.150+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA A40-48Q, compute capability 8.6, VMM: no
  Device 1: NVIDIA A40-48Q, compute capability 8.6, VMM: no
load_backend: loaded CUDA backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-04-11T13:34:26.223+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-04-11T13:34:27.311+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA0 size="24.6 GiB"
time=2025-04-11T13:34:27.311+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA1 size="26.5 GiB"
time=2025-04-11T13:34:27.311+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CPU size="2.6 GiB"
time=2025-04-11T13:34:59.491+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
time=2025-04-11T13:34:59.492+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA1 buffer_type=CUDA1
time=2025-04-11T13:34:59.492+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CPU buffer_type=CUDA_Host
time=2025-04-11T13:34:59.495+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-11T13:34:59.502+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-11T13:34:59.502+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-11T13:34:59.502+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-11T13:34:59.502+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-11T13:34:59.502+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-11T13:34:59.725+02:00 level=INFO source=server.go:619 msg="llama runner started in 33.83 seconds"
[GIN] 2025/04/11 - 13:35:14 | 200 |   53.8747518s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 13:35:15 | 200 |   55.4299227s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 13:35:21 | 200 |          1m1s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 13:35:28 | 200 |    6.5714205s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 13:35:32 | 200 |         1m11s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 13:35:36 | 200 |   22.5626104s |    10.112.4.104 | POST     "/api/chat"
[GIN] 2025/04/11 - 13:35:36 | 200 |         1m16s |    10.112.4.104 | POST     "/api/chat"
ggml.c:1584: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
<!-- gh-comment-id:2796648819 --> @forReason commented on GitHub (Apr 11, 2025): edit: according to https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900 unified memory should be on per default, thus leading to to a total of ~200 GB shared gpu memory. ~98 GB only GPU there should be plenty memory free ``` nvidia-smi Fri Apr 11 13:30:47 2025 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 474.14 Driver Version: 474.14 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A40-48Q WDDM | 00000000:02:01.0 On | 0 | | N/A 0C P0 N/A / N/A | 36025MiB / 49152MiB | 34% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A40-48Q WDDM | 00000000:02:02.0 Off | 0 | | N/A 0C P0 N/A / N/A | 38778MiB / 49152MiB | 51% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 6992 C+G C:\Windows\explorer.exe N/A | | 0 N/A N/A 8208 C+G Insufficient Permissions N/A | | 0 N/A N/A 8380 C+G ...y\ShellExperienceHost.exe N/A | | 0 N/A N/A 8788 C ...rograms\Ollama\ollama.exe N/A | | 0 N/A N/A 9084 C+G Insufficient Permissions N/A | | 0 N/A N/A 9136 C ...rograms\Ollama\ollama.exe N/A | | 1 N/A N/A 3048 C+G ...2txyewy\TextInputHost.exe N/A | | 1 N/A N/A 7548 C+G ...5n1h2txyewy\SearchApp.exe N/A | | 1 N/A N/A 8788 C ...rograms\Ollama\ollama.exe N/A | | 1 N/A N/A 9136 C ...rograms\Ollama\ollama.exe N/A | +-----------------------------------------------------------------------------+ ``` the same issue occurs without num_gpu override (ollama chooses amount layers on gpu) ``` time=2025-04-11T13:34:25.890+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\devops01\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\devops01\\.ollama\\models\\blobs\\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 --ctx-size 98304 --batch-size 512 --n-gpu-layers 63 --threads 8 --no-mmap --parallel 6 --tensor-split 32,31 --port 59445" time=2025-04-11T13:34:25.896+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-11T13:34:25.896+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-11T13:34:25.898+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-11T13:34:25.920+02:00 level=INFO source=runner.go:821 msg="starting ollama engine" time=2025-04-11T13:34:25.922+02:00 level=INFO source=runner.go:884 msg="Server listening on 127.0.0.1:59445" time=2025-04-11T13:34:25.985+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5307785 model=C:\Users\devops01\.ollama\models\blobs\sha256-07ca3450446e07c4e3dfd55d34e3f426963a15f1db00c3093d9214c202d12e25 time=2025-04-11T13:34:26.072+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" time=2025-04-11T13:34:26.073+02:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-04-11T13:34:26.073+02:00 level=INFO source=ggml.go:66 msg="" architecture=gemma3 file_type=F16 name="" description="" num_tensors=1247 num_key_values=37 time=2025-04-11T13:34:26.150+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA A40-48Q, compute capability 8.6, VMM: no Device 1: NVIDIA A40-48Q, compute capability 8.6, VMM: no load_backend: loaded CUDA backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\devops01\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-04-11T13:34:26.223+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-04-11T13:34:27.311+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA0 size="24.6 GiB" time=2025-04-11T13:34:27.311+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CUDA1 size="26.5 GiB" time=2025-04-11T13:34:27.311+02:00 level=INFO source=ggml.go:288 msg="model weights" buffer=CPU size="2.6 GiB" time=2025-04-11T13:34:59.491+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 time=2025-04-11T13:34:59.492+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 time=2025-04-11T13:34:59.492+02:00 level=INFO source=ggml.go:380 msg="compute graph" backend=CPU buffer_type=CUDA_Host time=2025-04-11T13:34:59.495+02:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-11T13:34:59.502+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-04-11T13:34:59.502+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-11T13:34:59.502+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-11T13:34:59.502+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-11T13:34:59.502+02:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-11T13:34:59.725+02:00 level=INFO source=server.go:619 msg="llama runner started in 33.83 seconds" [GIN] 2025/04/11 - 13:35:14 | 200 | 53.8747518s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 13:35:15 | 200 | 55.4299227s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 13:35:21 | 200 | 1m1s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 13:35:28 | 200 | 6.5714205s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 13:35:32 | 200 | 1m11s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 13:35:36 | 200 | 22.5626104s | 10.112.4.104 | POST "/api/chat" [GIN] 2025/04/11 - 13:35:36 | 200 | 1m16s | 10.112.4.104 | POST "/api/chat" ggml.c:1584: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed ```
Author
Owner

@rick-github commented on GitHub (Apr 11, 2025):

Does it happen if the model is loaded on just one GPU? I don't know what your processing requirements are, but if you lower OLLAMA_NUM_PARALLEL and num_ctx you might be able to squeeze it onto a single GPU.

<!-- gh-comment-id:2796677126 --> @rick-github commented on GitHub (Apr 11, 2025): Does it happen if the model is loaded on just one GPU? I don't know what your processing requirements are, but if you lower `OLLAMA_NUM_PARALLEL` and `num_ctx` you might be able to squeeze it onto a single GPU.
Author
Owner

@forReason commented on GitHub (Apr 11, 2025):

I dont know what num_ctx I actually need for full image transcription. I dont require a chat history, but a full page scan should be transcribed completely.

<!-- gh-comment-id:2796684689 --> @forReason commented on GitHub (Apr 11, 2025): I dont know what num_ctx I actually need for full image transcription. I dont require a chat history, but a full page scan should be transcribed completely.
Author
Owner

@rick-github commented on GitHub (Apr 11, 2025):

Set OLLAMA_NUM_PARALLEL=1 and then see what's the largest context you can use without spilling the model.

curl localhost:11434/API/generate -d "{\"model\":\"model-name\",\"options\":{\"num_ctx\":4096}}"

Adjust the value of num_ctx until nvidia-smi shows only one GPU in use.

<!-- gh-comment-id:2796701931 --> @rick-github commented on GitHub (Apr 11, 2025): Set `OLLAMA_NUM_PARALLEL=1` and then see what's the largest context you can use without spilling the model. ``` curl localhost:11434/API/generate -d "{\"model\":\"model-name\",\"options\":{\"num_ctx\":4096}}" ``` Adjust the value of `num_ctx` until `nvidia-smi` shows only one GPU in use.
Author
Owner

@forReason commented on GitHub (Apr 14, 2025):

The issue was not appearing with parallel=1 and a single gpu/cpu. the model did not fit entirely on one gpu.

I will test now with parallel=1 and 2 GPU's. But I suspect parallel to be the issue.


I have been unable to reproduce the issue with 2 GPU's and num_parralel=1

I will now try to gradually increase num parallel to 2, in orter to see if the issue is with parallel in general or if there is a specific treshold where the error arises for me.


error appears with parralel = 4
I am still not sure if its ram related, as there should be PLENTY free memory.

<!-- gh-comment-id:2800564898 --> @forReason commented on GitHub (Apr 14, 2025): The issue was not appearing with parallel=1 and a single gpu/cpu. the model did not fit entirely on one gpu. I will test now with parallel=1 and 2 GPU's. But I suspect parallel to be the issue. --- I have been unable to reproduce the issue with 2 GPU's and num_parralel=1 I will now try to gradually increase num parallel to 2, in orter to see if the issue is with parallel in general or if there is a specific treshold where the error arises for me. --- error appears with parralel = 4 I am still not sure if its ram related, as there should be PLENTY free memory.
Author
Owner

@forReason commented on GitHub (Apr 22, 2025):

So it seems the error happens with multiple parralel instances when the memory might not fit entirely into gpu anymore.
It seems the nvidia unified memory or the memory overflow into ram is not properly working.

This might be related to vGPU's which I think dont support unified memory, but ollama probably tries either way.

<!-- gh-comment-id:2820924062 --> @forReason commented on GitHub (Apr 22, 2025): So it seems the error happens with multiple parralel instances when the memory might not fit entirely into gpu anymore. It seems the nvidia unified memory or the memory overflow into ram is not properly working. This might be related to vGPU's which I think dont support unified memory, but ollama probably tries either way.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6715