[GH-ISSUE #10517] error="timed out waiting for llama runner to start" #53432

New Issue

GiteaMirror · 2026-04-29T03:08:46-05:00

GiteaMirror commented

2026-04-29 03:08:46 -05:00

Originally created by @neoxeo on GitHub (May 1, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10517

What is the issue?

When I try to run models like :

qwen3:30b-a3b 2ee832bc15b5 18 GB
MHKetbi/llm4decompile-22b-v2:q6_K 3565eae14497 18 GB
gemma3:27b-it-qat 29eb0b9aeda3 18 GB

I have error type :

time=2025-05-01T14:09:44.863+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.63"
load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
time=2025-05-01T14:09:45.114+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.64"
time=2025-05-01T14:14:45.292+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.64 - "
_**time=2025-05-01T14:14:45.292+02:00 level=DEBUG source=sched.go:460 msg="triggering expiration for failed load" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T14:14:45.292+02:00 level=DEBUG source=sched.go:362 msg="runner expired event received" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac**_
time=2025-05-01T14:14:45.292+02:00 level=DEBUG source=sched.go:377 msg="got lock to unload" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac

If I run other model like :

qwen3:14b
it's works fine

total duration:       1m13.0113207s
load duration:        73.5138ms
prompt eval count:    22 token(s)
prompt eval duration: 918.692ms
prompt eval rate:     23.95 tokens/s
eval count:           1534 token(s)
eval duration:        1m12.0093668s
eval rate:            21.30 tokens/s

Thanks for your job and your help !

Relevant log output

2025/05/01 14:08:44 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Ollama_Models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-05-01T14:08:45.015+02:00 level=INFO source=images.go:458 msg="total blobs: 50"
time=2025-05-01T14:08:45.030+02:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0"
time=2025-05-01T14:08:45.061+02:00 level=INFO source=routes.go:1299 msg="Listening on 127.0.0.1:11434 (version 0.6.6)"
time=2025-05-01T14:08:45.061+02:00 level=DEBUG source=sched.go:107 msg="starting llm scheduler"
time=2025-05-01T14:08:45.062+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-05-01T14:08:45.063+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-05-01T14:08:45.063+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=0 threads=28
time=2025-05-01T14:08:45.063+02:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-05-01T14:08:45.064+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll
time=2025-05-01T14:08:45.064+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvml.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvml.dll C:\\Python\\miniforge3\\nvml.dll C:\\Python\\miniforge3\\Scripts\\nvml.dll C:\\Users\\Testeur1\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2025-05-01T14:08:45.071+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll"
time=2025-05-01T14:08:45.078+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2025-05-01T14:08:45.125+02:00 level=DEBUG source=gpu.go:111 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll
time=2025-05-01T14:08:45.125+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvcuda.dll
time=2025-05-01T14:08:45.135+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvcuda.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvcuda.dll C:\\Python\\miniforge3\\nvcuda.dll C:\\Python\\miniforge3\\Scripts\\nvcuda.dll C:\\Users\\Testeur1\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]"
time=2025-05-01T14:08:45.143+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll"
time=2025-05-01T14:08:45.148+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll]
initializing C:\Windows\system32\nvcuda.dll
dlsym: cuInit - 00007FFEC0FF1F80
dlsym: cuDriverGetVersion - 00007FFEC0FF2020
dlsym: cuDeviceGetCount - 00007FFEC0FF2816
dlsym: cuDeviceGet - 00007FFEC0FF2810
dlsym: cuDeviceGetAttribute - 00007FFEC0FF2170
dlsym: cuDeviceGetUuid - 00007FFEC0FF2822
dlsym: cuDeviceGetName - 00007FFEC0FF281C
dlsym: cuCtxCreate_v3 - 00007FFEC0FF2894
dlsym: cuMemGetInfo_v2 - 00007FFEC0FF2996
dlsym: cuCtxDestroy - 00007FFEC0FF28A6
calling cuInit
calling cuDriverGetVersion
raw version 0x2f3a
CUDA driver version: 12.9
calling cuDeviceGetCount
device count 2
time=2025-05-01T14:08:45.250+02:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll
[GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA totalMem 12281 mb
[GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA freeMem 11248 mb
[GPU-962a842b-b382-6457-65a1-3cffec62ba6f] Compute Capability 8.6
time=2025-05-01T14:08:45.488+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda compute=8.6 driver=12.9 name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB"
[GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA totalMem 8191 mb
[GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA freeMem 7296 mb
[GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] Compute Capability 5.2
time=2025-05-01T14:08:45.612+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 library=cuda compute=5.2 driver=12.9 name="Quadro M5000" overhead="578.1 MiB"
time=2025-05-01T14:08:45.622+02:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: Le module spécifié est introuvable."
releasing cuda driver library
releasing nvml library
time=2025-05-01T14:08:45.624+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA RTX A2000 12GB" total="12.0 GiB" available="11.0 GiB"
time=2025-05-01T14:08:45.624+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 library=cuda variant=v12 compute=5.2 driver=12.9 name="Quadro M5000" total="8.0 GiB" available="7.1 GiB"
[GIN] 2025/05/01 - 14:09:31 | 200 |            0s |       127.0.0.1 | HEAD     "/"
time=2025-05-01T14:09:31.863+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T14:09:31.910+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
[GIN] 2025/05/01 - 14:09:31 | 200 |    140.0029ms |       127.0.0.1 | POST     "/api/show"
time=2025-05-01T14:09:32.179+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T14:09:32.189+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.5 GiB" before.free_swap="127.9 GiB" now.total="127.9 GiB" now.free="120.5 GiB" now.free_swap="127.9 GiB"
time=2025-05-01T14:09:32.237+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB"
time=2025-05-01T14:09:32.237+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="578.1 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="318.3 MiB"
releasing nvml library
time=2025-05-01T14:09:32.239+02:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=6 gpu_count=2
time=2025-05-01T14:09:32.270+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T14:09:32.306+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T14:09:32.325+02:00 level=DEBUG source=sched.go:226 msg="loading first model" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T14:09:32.325+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2025-05-01T14:09:32.325+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T14:09:32.327+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.1 GiB]"
time=2025-05-01T14:09:32.327+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T14:09:32.327+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2025-05-01T14:09:32.327+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T14:09:32.328+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.1 GiB]"
time=2025-05-01T14:09:32.328+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T14:09:32.334+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]"
time=2025-05-01T14:09:32.334+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T14:09:32.336+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]"
time=2025-05-01T14:09:32.336+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T14:09:32.339+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.5 GiB" before.free_swap="127.9 GiB" now.total="127.9 GiB" now.free="120.5 GiB" now.free_swap="127.9 GiB"
time=2025-05-01T14:09:32.379+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB"
time=2025-05-01T14:09:32.379+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="578.1 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="318.3 MiB"
releasing nvml library
time=2025-05-01T14:09:32.381+02:00 level=INFO source=server.go:105 msg="system memory" total="127.9 GiB" free="120.5 GiB" free_swap="127.9 GiB"
time=2025-05-01T14:09:32.381+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]"
time=2025-05-01T14:09:32.381+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T14:09:32.382+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=44 layers.split=27,17 memory.available="[11.0 GiB 7.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="19.5 GiB" memory.required.partial="17.8 GiB" memory.required.kv="192.0 MiB" memory.required.allocations="[10.7 GiB 7.1 GiB]" memory.weights.total="17.2 GiB" memory.weights.repeating="16.9 GiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB"
time=2025-05-01T14:09:32.383+02:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]"
llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv   7:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   8:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv   9:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  15:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  16:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  17:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  18:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type  f16:   48 tensors
llama_model_loader: - type q4_K:  265 tensors
llama_model_loader: - type q6_K:   25 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 17.34 GiB (4.88 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B
print_info: n_ff_exp         = 0
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-05-01T14:09:32.782+02:00 level=DEBUG source=server.go:335 msg="adding gpu library" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
time=2025-05-01T14:09:32.782+02:00 level=DEBUG source=server.go:343 msg="adding gpu dependency paths" paths=[C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12]
time=2025-05-01T14:09:32.783+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Ollama_Models\\blobs\\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac --ctx-size 2048 --batch-size 512 --n-gpu-layers 44 --verbose --threads 14 --no-mmap --parallel 1 --tensor-split 27,17 --port 49715"
time=2025-05-01T14:09:32.783+02:00 level=DEBUG source=server.go:423 msg=subprocess environment="[PATH=C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama;C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\Testeur1\\.dotnet\\tools;C:\\Python\\miniforge3;C:\\Python\\miniforge3\\Scripts;;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama CUDA_VISIBLE_DEVICES=GPU-962a842b-b382-6457-65a1-3cffec62ba6f,GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385]"
time=2025-05-01T14:09:32.799+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-05-01T14:09:32.799+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-05-01T14:09:32.801+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-05-01T14:09:32.864+02:00 level=INFO source=runner.go:853 msg="starting go runner"
time=2025-05-01T14:09:33.007+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A2000 12GB, compute capability 8.6, VMM: yes
  Device 1: Quadro M5000, compute capability 5.2, VMM: yes
load_backend: loaded CUDA backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-05-01T14:09:35.355+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin"
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS"
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS"
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\system32
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\Wbem
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\WindowsPowerShell\v1.0
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\OpenSSH
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Git\\cmd"
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL"
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL"
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT"
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT"
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\dotnet"
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common"
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin"
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Programs\Python\Launcher
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Microsoft\WindowsApps
time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama
time=2025-05-01T14:09:35.381+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\GitHubDesktop\bin
time=2025-05-01T14:09:35.381+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\.dotnet\tools
time=2025-05-01T14:09:35.381+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3
time=2025-05-01T14:09:35.381+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3\Scripts
time=2025-05-01T14:09:35.384+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1
time=2025-05-01T14:09:35.384+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
time=2025-05-01T14:09:35.589+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-05-01T14:09:35.591+02:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:49715"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA RTX A2000 12GB) - 11248 MiB free
llama_model_load_from_file_impl: using device CUDA1 (Quadro M5000) - 7296 MiB free
time=2025-05-01T14:09:35.831+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv   7:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   8:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv   9:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  15:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  16:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  17:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  18:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type  f16:   48 tensors
llama_model_loader: - type q4_K:  265 tensors
llama_model_loader: - type q6_K:   25 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 17.34 GiB (4.88 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 2048
print_info: n_layer          = 48
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = ?B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B
print_info: n_ff_exp         = 768
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
load_tensors: layer  22 assigned to device CUDA0, is_swa = 0
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 0
load_tensors: layer  25 assigned to device CUDA0, is_swa = 0
load_tensors: layer  26 assigned to device CUDA0, is_swa = 0
load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
load_tensors: layer  28 assigned to device CUDA0, is_swa = 0
load_tensors: layer  29 assigned to device CUDA0, is_swa = 0
load_tensors: layer  30 assigned to device CUDA0, is_swa = 0
load_tensors: layer  31 assigned to device CUDA1, is_swa = 0
load_tensors: layer  32 assigned to device CUDA1, is_swa = 0
load_tensors: layer  33 assigned to device CUDA1, is_swa = 0
load_tensors: layer  34 assigned to device CUDA1, is_swa = 0
load_tensors: layer  35 assigned to device CUDA1, is_swa = 0
load_tensors: layer  36 assigned to device CUDA1, is_swa = 0
load_tensors: layer  37 assigned to device CUDA1, is_swa = 0
load_tensors: layer  38 assigned to device CUDA1, is_swa = 0
load_tensors: layer  39 assigned to device CUDA1, is_swa = 0
load_tensors: layer  40 assigned to device CUDA1, is_swa = 0
load_tensors: layer  41 assigned to device CUDA1, is_swa = 0
load_tensors: layer  42 assigned to device CUDA1, is_swa = 0
load_tensors: layer  43 assigned to device CUDA1, is_swa = 0
load_tensors: layer  44 assigned to device CUDA1, is_swa = 0
load_tensors: layer  45 assigned to device CUDA1, is_swa = 0
load_tensors: layer  46 assigned to device CUDA1, is_swa = 0
load_tensors: layer  47 assigned to device CUDA1, is_swa = 0
load_tensors: layer  48 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 44 repeating layers to GPU
load_tensors: offloaded 44/49 layers to GPU
load_tensors:    CUDA_Host model buffer size =  1787.75 MiB
load_tensors:        CUDA0 model buffer size =  9582.64 MiB
load_tensors:        CUDA1 model buffer size =  6216.85 MiB
load_tensors:          CPU model buffer size =   166.92 MiB
load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads
time=2025-05-01T14:09:37.086+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.00"
time=2025-05-01T14:09:37.337+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.02"
time=2025-05-01T14:09:37.587+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.04"
time=2025-05-01T14:09:37.838+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.06"
time=2025-05-01T14:09:38.089+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.08"
time=2025-05-01T14:09:38.340+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.09"
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
time=2025-05-01T14:09:38.590+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.12"
time=2025-05-01T14:09:38.841+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.13"
time=2025-05-01T14:09:39.092+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.15"
time=2025-05-01T14:09:39.342+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.17"
time=2025-05-01T14:09:39.592+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.19"
time=2025-05-01T14:09:39.843+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.21"
time=2025-05-01T14:09:40.095+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.23"
time=2025-05-01T14:09:40.346+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.25"
time=2025-05-01T14:09:40.596+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.27"
time=2025-05-01T14:09:40.846+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.29"
time=2025-05-01T14:09:41.097+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.31"
time=2025-05-01T14:09:41.348+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.34"
time=2025-05-01T14:09:41.600+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.36"
time=2025-05-01T14:09:41.851+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.38"
time=2025-05-01T14:09:42.102+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.40"
time=2025-05-01T14:09:42.354+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.41"
time=2025-05-01T14:09:42.605+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.44"
time=2025-05-01T14:09:42.856+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.46"
time=2025-05-01T14:09:43.107+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.48"
time=2025-05-01T14:09:43.358+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.50"
time=2025-05-01T14:09:43.608+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.53"
time=2025-05-01T14:09:43.859+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.55"
time=2025-05-01T14:09:44.110+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.57"
time=2025-05-01T14:09:44.361+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.59"
time=2025-05-01T14:09:44.612+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.62"
time=2025-05-01T14:09:44.863+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.63"
load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
time=2025-05-01T14:09:45.114+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.64"
time=2025-05-01T14:14:45.292+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.64 - "
time=2025-05-01T14:14:45.292+02:00 level=DEBUG source=sched.go:460 msg="triggering expiration for failed load" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T14:14:45.292+02:00 level=DEBUG source=sched.go:362 msg="runner expired event received" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T14:14:45.292+02:00 level=DEBUG source=sched.go:377 msg="got lock to unload" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
[GIN] 2025/05/01 - 14:14:45 | 500 |         5m13s |       127.0.0.1 | POST     "/api/generate"
time=2025-05-01T14:14:45.292+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.5 GiB" before.free_swap="127.9 GiB" now.total="127.9 GiB" now.free="118.0 GiB" now.free_swap="109.6 GiB"
time=2025-05-01T14:14:45.339+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="1.5 GiB" now.used="9.6 GiB"
time=2025-05-01T14:14:45.339+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="578.1 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="964.8 MiB" now.used="6.5 GiB"
releasing nvml library
time=2025-05-01T14:14:45.371+02:00 level=DEBUG source=server.go:1001 msg="stopping llama server"
time=2025-05-01T14:14:45.380+02:00 level=DEBUG source=server.go:1007 msg="waiting for llama server to exit"
time=2025-05-01T14:14:45.592+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="118.0 GiB" before.free_swap="109.6 GiB" now.total="127.9 GiB" now.free="118.1 GiB" now.free_swap="119.6 GiB"
time=2025-05-01T14:14:45.798+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="1.5 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB"
time=2025-05-01T14:14:45.799+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="578.1 MiB" before.total="8.0 GiB" before.free="964.8 MiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="364.0 MiB"
releasing nvml library
time=2025-05-01T14:14:45.802+02:00 level=DEBUG source=sched.go:661 msg="gpu VRAM free memory converged after 0.51 seconds" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T14:14:45.880+02:00 level=DEBUG source=server.go:1011 msg="llama server stopped"
time=2025-05-01T14:14:45.881+02:00 level=DEBUG source=sched.go:382 msg="runner released" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T14:14:45.881+02:00 level=DEBUG source=sched.go:386 msg="sending an unloaded event" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T14:14:45.881+02:00 level=DEBUG source=sched.go:310 msg="ignoring unload event with no pending requests"

OS

Windows 11 Pro 24H2

GPU

RTX A2000 12gb
M5000 8 gb

CPU

Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz 2.40 GHz

Ollama version

0.6.6

Originally created by @neoxeo on GitHub (May 1, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10517 ### What is the issue? When I try to run models like : - qwen3:30b-a3b 2ee832bc15b5 18 GB - MHKetbi/llm4decompile-22b-v2:q6_K 3565eae14497 18 GB - gemma3:27b-it-qat 29eb0b9aeda3 18 GB I have error type : ``` time=2025-05-01T14:09:44.863+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.63" load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1 time=2025-05-01T14:09:45.114+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.64" time=2025-05-01T14:14:45.292+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.64 - " _**time=2025-05-01T14:14:45.292+02:00 level=DEBUG source=sched.go:460 msg="triggering expiration for failed load" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T14:14:45.292+02:00 level=DEBUG source=sched.go:362 msg="runner expired event received" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac**_ time=2025-05-01T14:14:45.292+02:00 level=DEBUG source=sched.go:377 msg="got lock to unload" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac ``` If I run other model like : - qwen3:14b it's works fine ``` total duration: 1m13.0113207s load duration: 73.5138ms prompt eval count: 22 token(s) prompt eval duration: 918.692ms prompt eval rate: 23.95 tokens/s eval count: 1534 token(s) eval duration: 1m12.0093668s eval rate: 21.30 tokens/s ``` Thanks for your job and your help ! ### Relevant log output ```shell 2025/05/01 14:08:44 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Ollama_Models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-05-01T14:08:45.015+02:00 level=INFO source=images.go:458 msg="total blobs: 50" time=2025-05-01T14:08:45.030+02:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0" time=2025-05-01T14:08:45.061+02:00 level=INFO source=routes.go:1299 msg="Listening on 127.0.0.1:11434 (version 0.6.6)" time=2025-05-01T14:08:45.061+02:00 level=DEBUG source=sched.go:107 msg="starting llm scheduler" time=2025-05-01T14:08:45.062+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-05-01T14:08:45.063+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-05-01T14:08:45.063+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=0 threads=28 time=2025-05-01T14:08:45.063+02:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-05-01T14:08:45.064+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll time=2025-05-01T14:08:45.064+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvml.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvml.dll C:\\Python\\miniforge3\\nvml.dll C:\\Python\\miniforge3\\Scripts\\nvml.dll C:\\Users\\Testeur1\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-05-01T14:08:45.071+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll" time=2025-05-01T14:08:45.078+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-05-01T14:08:45.125+02:00 level=DEBUG source=gpu.go:111 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll time=2025-05-01T14:08:45.125+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvcuda.dll time=2025-05-01T14:08:45.135+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvcuda.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvcuda.dll C:\\Python\\miniforge3\\nvcuda.dll C:\\Python\\miniforge3\\Scripts\\nvcuda.dll C:\\Users\\Testeur1\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]" time=2025-05-01T14:08:45.143+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll" time=2025-05-01T14:08:45.148+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll] initializing C:\Windows\system32\nvcuda.dll dlsym: cuInit - 00007FFEC0FF1F80 dlsym: cuDriverGetVersion - 00007FFEC0FF2020 dlsym: cuDeviceGetCount - 00007FFEC0FF2816 dlsym: cuDeviceGet - 00007FFEC0FF2810 dlsym: cuDeviceGetAttribute - 00007FFEC0FF2170 dlsym: cuDeviceGetUuid - 00007FFEC0FF2822 dlsym: cuDeviceGetName - 00007FFEC0FF281C dlsym: cuCtxCreate_v3 - 00007FFEC0FF2894 dlsym: cuMemGetInfo_v2 - 00007FFEC0FF2996 dlsym: cuCtxDestroy - 00007FFEC0FF28A6 calling cuInit calling cuDriverGetVersion raw version 0x2f3a CUDA driver version: 12.9 calling cuDeviceGetCount device count 2 time=2025-05-01T14:08:45.250+02:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll [GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA totalMem 12281 mb [GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA freeMem 11248 mb [GPU-962a842b-b382-6457-65a1-3cffec62ba6f] Compute Capability 8.6 time=2025-05-01T14:08:45.488+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda compute=8.6 driver=12.9 name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" [GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA totalMem 8191 mb [GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA freeMem 7296 mb [GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] Compute Capability 5.2 time=2025-05-01T14:08:45.612+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 library=cuda compute=5.2 driver=12.9 name="Quadro M5000" overhead="578.1 MiB" time=2025-05-01T14:08:45.622+02:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: Le module spécifié est introuvable." releasing cuda driver library releasing nvml library time=2025-05-01T14:08:45.624+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA RTX A2000 12GB" total="12.0 GiB" available="11.0 GiB" time=2025-05-01T14:08:45.624+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 library=cuda variant=v12 compute=5.2 driver=12.9 name="Quadro M5000" total="8.0 GiB" available="7.1 GiB" [GIN] 2025/05/01 - 14:09:31 | 200 | 0s | 127.0.0.1 | HEAD "/" time=2025-05-01T14:09:31.863+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T14:09:31.910+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 [GIN] 2025/05/01 - 14:09:31 | 200 | 140.0029ms | 127.0.0.1 | POST "/api/show" time=2025-05-01T14:09:32.179+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T14:09:32.189+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.5 GiB" before.free_swap="127.9 GiB" now.total="127.9 GiB" now.free="120.5 GiB" now.free_swap="127.9 GiB" time=2025-05-01T14:09:32.237+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB" time=2025-05-01T14:09:32.237+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="578.1 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="318.3 MiB" releasing nvml library time=2025-05-01T14:09:32.239+02:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=6 gpu_count=2 time=2025-05-01T14:09:32.270+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T14:09:32.306+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T14:09:32.325+02:00 level=DEBUG source=sched.go:226 msg="loading first model" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T14:09:32.325+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]" time=2025-05-01T14:09:32.325+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T14:09:32.327+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.1 GiB]" time=2025-05-01T14:09:32.327+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T14:09:32.327+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]" time=2025-05-01T14:09:32.327+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T14:09:32.328+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.1 GiB]" time=2025-05-01T14:09:32.328+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T14:09:32.334+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]" time=2025-05-01T14:09:32.334+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T14:09:32.336+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]" time=2025-05-01T14:09:32.336+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T14:09:32.339+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.5 GiB" before.free_swap="127.9 GiB" now.total="127.9 GiB" now.free="120.5 GiB" now.free_swap="127.9 GiB" time=2025-05-01T14:09:32.379+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB" time=2025-05-01T14:09:32.379+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="578.1 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="318.3 MiB" releasing nvml library time=2025-05-01T14:09:32.381+02:00 level=INFO source=server.go:105 msg="system memory" total="127.9 GiB" free="120.5 GiB" free_swap="127.9 GiB" time=2025-05-01T14:09:32.381+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]" time=2025-05-01T14:09:32.381+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T14:09:32.382+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=44 layers.split=27,17 memory.available="[11.0 GiB 7.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="19.5 GiB" memory.required.partial="17.8 GiB" memory.required.kv="192.0 MiB" memory.required.allocations="[10.7 GiB 7.1 GiB]" memory.weights.total="17.2 GiB" memory.weights.repeating="16.9 GiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB" time=2025-05-01T14:09:32.383+02:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]" llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 30B A3B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 30B-A3B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 7: qwen3moe.context_length u32 = 40960 llama_model_loader: - kv 8: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 9: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 10: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen3moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 15: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 16: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 17: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 18: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 20: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 15 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type f16: 48 tensors llama_model_loader: - type q4_K: 265 tensors llama_model_loader: - type q6_K: 25 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 17.34 GiB (4.88 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 30.53 B print_info: general.name = Qwen3 30B A3B print_info: n_ff_exp = 0 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-05-01T14:09:32.782+02:00 level=DEBUG source=server.go:335 msg="adding gpu library" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 time=2025-05-01T14:09:32.782+02:00 level=DEBUG source=server.go:343 msg="adding gpu dependency paths" paths=[C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12] time=2025-05-01T14:09:32.783+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Ollama_Models\\blobs\\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac --ctx-size 2048 --batch-size 512 --n-gpu-layers 44 --verbose --threads 14 --no-mmap --parallel 1 --tensor-split 27,17 --port 49715" time=2025-05-01T14:09:32.783+02:00 level=DEBUG source=server.go:423 msg=subprocess environment="[PATH=C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama;C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\Testeur1\\.dotnet\\tools;C:\\Python\\miniforge3;C:\\Python\\miniforge3\\Scripts;;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama CUDA_VISIBLE_DEVICES=GPU-962a842b-b382-6457-65a1-3cffec62ba6f,GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385]" time=2025-05-01T14:09:32.799+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-05-01T14:09:32.799+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-05-01T14:09:32.801+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-05-01T14:09:32.864+02:00 level=INFO source=runner.go:853 msg="starting go runner" time=2025-05-01T14:09:33.007+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA RTX A2000 12GB, compute capability 8.6, VMM: yes Device 1: Quadro M5000, compute capability 5.2, VMM: yes load_backend: loaded CUDA backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-05-01T14:09:35.355+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin" time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS" time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS" time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\system32 time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\Wbem time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\WindowsPowerShell\v1.0 time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\OpenSSH time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Git\\cmd" time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL" time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL" time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT" time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT" time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\dotnet" time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common" time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin" time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Programs\Python\Launcher time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Microsoft\WindowsApps time=2025-05-01T14:09:35.362+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama time=2025-05-01T14:09:35.381+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\GitHubDesktop\bin time=2025-05-01T14:09:35.381+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\.dotnet\tools time=2025-05-01T14:09:35.381+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3 time=2025-05-01T14:09:35.381+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3\Scripts time=2025-05-01T14:09:35.384+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1 time=2025-05-01T14:09:35.384+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll time=2025-05-01T14:09:35.589+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-05-01T14:09:35.591+02:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:49715" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA RTX A2000 12GB) - 11248 MiB free llama_model_load_from_file_impl: using device CUDA1 (Quadro M5000) - 7296 MiB free time=2025-05-01T14:09:35.831+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 30B A3B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 30B-A3B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 7: qwen3moe.context_length u32 = 40960 llama_model_loader: - kv 8: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 9: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 10: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen3moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 15: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 16: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 17: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 18: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 20: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 15 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type f16: 48 tensors llama_model_loader: - type q4_K: 265 tensors llama_model_loader: - type q6_K: 25 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 17.34 GiB (4.88 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 0 print_info: n_ctx_train = 40960 print_info: n_embd = 2048 print_info: n_layer = 48 print_info: n_head = 32 print_info: n_head_kv = 4 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 6144 print_info: n_expert = 128 print_info: n_expert_used = 8 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 40960 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = ?B print_info: model params = 30.53 B print_info: general.name = Qwen3 30B A3B print_info: n_ff_exp = 768 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: layer 0 assigned to device CPU, is_swa = 0 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 0 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CUDA0, is_swa = 0 load_tensors: layer 5 assigned to device CUDA0, is_swa = 0 load_tensors: layer 6 assigned to device CUDA0, is_swa = 0 load_tensors: layer 7 assigned to device CUDA0, is_swa = 0 load_tensors: layer 8 assigned to device CUDA0, is_swa = 0 load_tensors: layer 9 assigned to device CUDA0, is_swa = 0 load_tensors: layer 10 assigned to device CUDA0, is_swa = 0 load_tensors: layer 11 assigned to device CUDA0, is_swa = 0 load_tensors: layer 12 assigned to device CUDA0, is_swa = 0 load_tensors: layer 13 assigned to device CUDA0, is_swa = 0 load_tensors: layer 14 assigned to device CUDA0, is_swa = 0 load_tensors: layer 15 assigned to device CUDA0, is_swa = 0 load_tensors: layer 16 assigned to device CUDA0, is_swa = 0 load_tensors: layer 17 assigned to device CUDA0, is_swa = 0 load_tensors: layer 18 assigned to device CUDA0, is_swa = 0 load_tensors: layer 19 assigned to device CUDA0, is_swa = 0 load_tensors: layer 20 assigned to device CUDA0, is_swa = 0 load_tensors: layer 21 assigned to device CUDA0, is_swa = 0 load_tensors: layer 22 assigned to device CUDA0, is_swa = 0 load_tensors: layer 23 assigned to device CUDA0, is_swa = 0 load_tensors: layer 24 assigned to device CUDA0, is_swa = 0 load_tensors: layer 25 assigned to device CUDA0, is_swa = 0 load_tensors: layer 26 assigned to device CUDA0, is_swa = 0 load_tensors: layer 27 assigned to device CUDA0, is_swa = 0 load_tensors: layer 28 assigned to device CUDA0, is_swa = 0 load_tensors: layer 29 assigned to device CUDA0, is_swa = 0 load_tensors: layer 30 assigned to device CUDA0, is_swa = 0 load_tensors: layer 31 assigned to device CUDA1, is_swa = 0 load_tensors: layer 32 assigned to device CUDA1, is_swa = 0 load_tensors: layer 33 assigned to device CUDA1, is_swa = 0 load_tensors: layer 34 assigned to device CUDA1, is_swa = 0 load_tensors: layer 35 assigned to device CUDA1, is_swa = 0 load_tensors: layer 36 assigned to device CUDA1, is_swa = 0 load_tensors: layer 37 assigned to device CUDA1, is_swa = 0 load_tensors: layer 38 assigned to device CUDA1, is_swa = 0 load_tensors: layer 39 assigned to device CUDA1, is_swa = 0 load_tensors: layer 40 assigned to device CUDA1, is_swa = 0 load_tensors: layer 41 assigned to device CUDA1, is_swa = 0 load_tensors: layer 42 assigned to device CUDA1, is_swa = 0 load_tensors: layer 43 assigned to device CUDA1, is_swa = 0 load_tensors: layer 44 assigned to device CUDA1, is_swa = 0 load_tensors: layer 45 assigned to device CUDA1, is_swa = 0 load_tensors: layer 46 assigned to device CUDA1, is_swa = 0 load_tensors: layer 47 assigned to device CUDA1, is_swa = 0 load_tensors: layer 48 assigned to device CPU, is_swa = 0 load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead load_tensors: offloading 44 repeating layers to GPU load_tensors: offloaded 44/49 layers to GPU load_tensors: CUDA_Host model buffer size = 1787.75 MiB load_tensors: CUDA0 model buffer size = 9582.64 MiB load_tensors: CUDA1 model buffer size = 6216.85 MiB load_tensors: CPU model buffer size = 166.92 MiB load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads time=2025-05-01T14:09:37.086+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.00" time=2025-05-01T14:09:37.337+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.02" time=2025-05-01T14:09:37.587+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.04" time=2025-05-01T14:09:37.838+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.06" time=2025-05-01T14:09:38.089+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.08" time=2025-05-01T14:09:38.340+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.09" load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2025-05-01T14:09:38.590+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.12" time=2025-05-01T14:09:38.841+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.13" time=2025-05-01T14:09:39.092+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.15" time=2025-05-01T14:09:39.342+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.17" time=2025-05-01T14:09:39.592+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.19" time=2025-05-01T14:09:39.843+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.21" time=2025-05-01T14:09:40.095+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.23" time=2025-05-01T14:09:40.346+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.25" time=2025-05-01T14:09:40.596+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.27" time=2025-05-01T14:09:40.846+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.29" time=2025-05-01T14:09:41.097+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.31" time=2025-05-01T14:09:41.348+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.34" time=2025-05-01T14:09:41.600+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.36" time=2025-05-01T14:09:41.851+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.38" time=2025-05-01T14:09:42.102+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.40" time=2025-05-01T14:09:42.354+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.41" time=2025-05-01T14:09:42.605+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.44" time=2025-05-01T14:09:42.856+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.46" time=2025-05-01T14:09:43.107+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.48" time=2025-05-01T14:09:43.358+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.50" time=2025-05-01T14:09:43.608+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.53" time=2025-05-01T14:09:43.859+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.55" time=2025-05-01T14:09:44.110+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.57" time=2025-05-01T14:09:44.361+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.59" time=2025-05-01T14:09:44.612+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.62" time=2025-05-01T14:09:44.863+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.63" load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1 time=2025-05-01T14:09:45.114+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.64" time=2025-05-01T14:14:45.292+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.64 - " time=2025-05-01T14:14:45.292+02:00 level=DEBUG source=sched.go:460 msg="triggering expiration for failed load" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T14:14:45.292+02:00 level=DEBUG source=sched.go:362 msg="runner expired event received" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T14:14:45.292+02:00 level=DEBUG source=sched.go:377 msg="got lock to unload" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac [GIN] 2025/05/01 - 14:14:45 | 500 | 5m13s | 127.0.0.1 | POST "/api/generate" time=2025-05-01T14:14:45.292+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.5 GiB" before.free_swap="127.9 GiB" now.total="127.9 GiB" now.free="118.0 GiB" now.free_swap="109.6 GiB" time=2025-05-01T14:14:45.339+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="1.5 GiB" now.used="9.6 GiB" time=2025-05-01T14:14:45.339+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="578.1 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="964.8 MiB" now.used="6.5 GiB" releasing nvml library time=2025-05-01T14:14:45.371+02:00 level=DEBUG source=server.go:1001 msg="stopping llama server" time=2025-05-01T14:14:45.380+02:00 level=DEBUG source=server.go:1007 msg="waiting for llama server to exit" time=2025-05-01T14:14:45.592+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="118.0 GiB" before.free_swap="109.6 GiB" now.total="127.9 GiB" now.free="118.1 GiB" now.free_swap="119.6 GiB" time=2025-05-01T14:14:45.798+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="1.5 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB" time=2025-05-01T14:14:45.799+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="578.1 MiB" before.total="8.0 GiB" before.free="964.8 MiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="364.0 MiB" releasing nvml library time=2025-05-01T14:14:45.802+02:00 level=DEBUG source=sched.go:661 msg="gpu VRAM free memory converged after 0.51 seconds" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T14:14:45.880+02:00 level=DEBUG source=server.go:1011 msg="llama server stopped" time=2025-05-01T14:14:45.881+02:00 level=DEBUG source=sched.go:382 msg="runner released" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T14:14:45.881+02:00 level=DEBUG source=sched.go:386 msg="sending an unloaded event" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T14:14:45.881+02:00 level=DEBUG source=sched.go:310 msg="ignoring unload event with no pending requests" ``` ### OS Windows 11 Pro 24H2 ### GPU RTX A2000 12gb M5000 8 gb ### CPU Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz 2.40 GHz ### Ollama version 0.6.6

GiteaMirror added the bug label 2026-04-29 03:08:46 -05:00

GiteaMirror closed this issue

2026-04-29 03:08:48 -05:00

GiteaMirror commented

2026-04-29 03:08:50 -05:00

@rick-github commented on GitHub (May 1, 2025):

time=2025-05-01T14:09:45.114+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.64"
time=2025-05-01T14:14:45.292+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.64 - "

Something caused the loader to slow down/stop, and the load timeout (OLLAMA_LOAD_TIMEOUT, default 5 minutes) expired and aborted the load. This sometimes happens for large models on network storage, but it looks like the model is local and the file is only 18G so it's not clear what the problem is. What happens if you try to manually read the whole file:

copy C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac temporay-file

@rick-github commented on GitHub (May 1, 2025): ``` time=2025-05-01T14:09:45.114+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.64" time=2025-05-01T14:14:45.292+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.64 - " ``` Something caused the loader to slow down/stop, and the load timeout (`OLLAMA_LOAD_TIMEOUT`, default 5 minutes) expired and aborted the load. This sometimes happens for large models on network storage, but it looks like the model is local and the file is only 18G so it's not clear what the problem is. What happens if you try to manually read the whole file: ``` copy C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac temporay-file ```

GiteaMirror commented

2026-04-29 03:08:51 -05:00

@neoxeo commented on GitHub (May 1, 2025):

Thanks @rick-github for your answer.

I confirm that they are local models.

copy C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac temporay-file

works fine without error :

PS C:\> copy C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac temporay-file

PS C:\> ls

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d-----        26/04/2025     11:30                Ollama_Models
d-----        26/04/2025     10:11                PerfLogs
d-r---        26/04/2025     10:14                Program Files
d-r---        26/04/2025     10:15                Program Files (x86)
d-r---        26/04/2025     10:19                Users
d-----        26/04/2025     13:35                Windows
d-----        26/04/2025     10:28                WindowsApps
-a----        01/05/2025     10:56    18622549504 temporay-file

PS C:\>

@neoxeo commented on GitHub (May 1, 2025): Thanks @rick-github for your answer. I confirm that they are local models. `copy C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac temporay-file` works fine without error : ``` PS C:\> copy C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac temporay-file PS C:\> ls Mode LastWriteTime Length Name ---- ------------- ------ ---- d----- 26/04/2025 11:30 Ollama_Models d----- 26/04/2025 10:11 PerfLogs d-r--- 26/04/2025 10:14 Program Files d-r--- 26/04/2025 10:15 Program Files (x86) d-r--- 26/04/2025 10:19 Users d----- 26/04/2025 13:35 Windows d----- 26/04/2025 10:28 WindowsApps -a---- 01/05/2025 10:56 18622549504 temporay-file PS C:\> ```

GiteaMirror commented

2026-04-29 03:08:52 -05:00

@rick-github commented on GitHub (May 1, 2025):

Did the copy take more than 5 minutes?

@rick-github commented on GitHub (May 1, 2025): Did the copy take more than 5 minutes?

GiteaMirror commented

2026-04-29 03:08:52 -05:00

@neoxeo commented on GitHub (May 1, 2025):

I launch a new copy with trace :
Start time: 05/01/2025 15:58:37
End time: 05/01/2025 15:58:49

It takes 12 sec

@neoxeo commented on GitHub (May 1, 2025): I launch a new copy with trace : Start time: 05/01/2025 15:58:37 End time: 05/01/2025 15:58:49 It takes 12 sec

GiteaMirror commented

2026-04-29 03:08:53 -05:00

@rick-github commented on GitHub (May 1, 2025):

It might be an issue with your M5000 device. The load stalled at 0.64. 64% of the 49 layers in the model is 31, which is around the layer where it switched from the A2000 to the M5000:

load_tensors: layer  30 assigned to device CUDA0, is_swa = 0
load_tensors: layer  31 assigned to device CUDA1, is_swa = 0

qwen3:14b is a 9G model, which will fit entirely on the A2000, but the other models are bigger and will be spread across both devices. You can experiment with CUDA_VISIBLE_DEVICES to see if using only one GPU or the other results in a successful model load.

@rick-github commented on GitHub (May 1, 2025): It might be an issue with your M5000 device. The load stalled at 0.64. 64% of the 49 layers in the model is 31, which is around the layer where it switched from the A2000 to the M5000: ``` load_tensors: layer 30 assigned to device CUDA0, is_swa = 0 load_tensors: layer 31 assigned to device CUDA1, is_swa = 0 ``` qwen3:14b is a 9G model, which will fit entirely on the A2000, but the other models are bigger and will be spread across both devices. You can experiment with [`CUDA_VISIBLE_DEVICES`](https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection) to see if using only one GPU or the other results in a successful model load.

GiteaMirror commented

2026-04-29 03:08:55 -05:00

@neoxeo commented on GitHub (May 1, 2025):

I try to load a model only to M5000 GPU

PS C:\> nvidia-smi -L
GPU 0: Quadro M5000 (UUID: GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385)
GPU 1: NVIDIA RTX A2000 12GB (UUID: GPU-962a842b-b382-6457-65a1-3cffec62ba6f)

PS C:\> $env:OLLAMA_DEBUG="1"; $env:CUDA_VISIBLE_DEVICES="1" ; & "ollama app.exe"

PS C:\> ollama list
NAME                                 ID              SIZE      MODIFIED
qwen3:14b                            7d7da67570e2    9.3 GB    2 hours ago
qwen3:30b-a3b                        2ee832bc15b5    18 GB     5 hours ago
MHKetbi/llm4decompile-22b-v2:q6_K    3565eae14497    18 GB     5 days ago
gemma3:27b-it-qat                    29eb0b9aeda3    18 GB     11 days ago
mistral-small3.1:latest              b9aaf0c2586a    15 GB     2 weeks ago
qwq:latest                           009cb3f08d74    19 GB     2 weeks ago
deepcoder:latest                     12bdda054d23    9.0 GB    2 weeks ago
granite3-dense:8b                    199456d876ee    4.9 GB    5 months ago
mistral-nemo:latest                  994f3b8b7801    7.1 GB    5 months ago
qwen2.5-coder:14b                    3028237cc8c5    9.0 GB    5 months ago
llama3.2-vision:latest               38107a0cd119    7.9 GB    5 months ago

PS C:\> ollama run mistral-nemo:latest --verbose
>>> Send a message (/? for help)

Model loaded on M5000 :

Logs :

load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads
time=2025-05-01T16:16:18.409+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.05"
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
time=2025-05-01T16:16:18.661+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.18"
time=2025-05-01T16:16:18.911+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.24"
time=2025-05-01T16:16:19.163+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.32"
time=2025-05-01T16:16:19.414+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.39"
time=2025-05-01T16:16:19.665+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.46"
time=2025-05-01T16:16:19.917+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.53"
time=2025-05-01T16:16:20.168+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.60"
time=2025-05-01T16:16:20.419+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.66"
time=2025-05-01T16:16:20.670+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.73"
time=2025-05-01T16:16:20.921+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.81"
time=2025-05-01T16:16:21.172+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.87"
time=2025-05-01T16:16:21.423+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.94"
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (2048) < n_ctx_train (1024000) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.52 MiB
llama_context: n_ctx = 2048
llama_context: n_ctx = 2048 (padded)
init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
init: layer   0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer   1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer   2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init:      CUDA0 KV buffer size =   304.00 MiB
init:        CPU KV buffer size =    16.00 MiB
llama_context: KV self size  =  320.00 MiB, K (f16):  160.00 MiB, V (f16):  160.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      CUDA0 compute buffer size =   791.00 MiB
llama_context:  CUDA_Host compute buffer size =    14.01 MiB
llama_context: graph nodes  = 1366
llama_context: graph splits = 26 (with bs=512), 3 (with bs=1)
time=2025-05-01T16:16:21.674+02:00 level=INFO source=server.go:619 msg="llama runner started in 5.79 seconds"
time=2025-05-01T16:16:21.674+02:00 level=DEBUG source=sched.go:464 msg="finished setting up runner" model=C:\Ollama_Models\blobs\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94
[GIN] 2025/05/01 - 16:16:21 | 200 |     6.582344s |       127.0.0.1 | POST     "/api/generate"
time=2025-05-01T16:16:21.675+02:00 level=DEBUG source=sched.go:468 msg="context for request finished"
time=2025-05-01T16:16:21.675+02:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Ollama_Models\blobs\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 duration=5m0s
time=2025-05-01T16:16:21.675+02:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" modelPath=C:\Ollama_Models\blobs\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 refCount=0

M5000 seems to works fine.

Problem is present only when I use the 2 GPU

@neoxeo commented on GitHub (May 1, 2025): I try to load a model only to M5000 GPU ``` PS C:\> nvidia-smi -L GPU 0: Quadro M5000 (UUID: GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385) GPU 1: NVIDIA RTX A2000 12GB (UUID: GPU-962a842b-b382-6457-65a1-3cffec62ba6f) PS C:\> $env:OLLAMA_DEBUG="1"; $env:CUDA_VISIBLE_DEVICES="1" ; & "ollama app.exe" PS C:\> ollama list NAME ID SIZE MODIFIED qwen3:14b 7d7da67570e2 9.3 GB 2 hours ago qwen3:30b-a3b 2ee832bc15b5 18 GB 5 hours ago MHKetbi/llm4decompile-22b-v2:q6_K 3565eae14497 18 GB 5 days ago gemma3:27b-it-qat 29eb0b9aeda3 18 GB 11 days ago mistral-small3.1:latest b9aaf0c2586a 15 GB 2 weeks ago qwq:latest 009cb3f08d74 19 GB 2 weeks ago deepcoder:latest 12bdda054d23 9.0 GB 2 weeks ago granite3-dense:8b 199456d876ee 4.9 GB 5 months ago mistral-nemo:latest 994f3b8b7801 7.1 GB 5 months ago qwen2.5-coder:14b 3028237cc8c5 9.0 GB 5 months ago llama3.2-vision:latest 38107a0cd119 7.9 GB 5 months ago PS C:\> ollama run mistral-nemo:latest --verbose >>> Send a message (/? for help) ``` Model loaded on M5000 : ![Image](https://github.com/user-attachments/assets/d4d31a94-5820-43c5-8321-66401fc406ec) Logs : ``` load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads time=2025-05-01T16:16:18.409+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.05" load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2025-05-01T16:16:18.661+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.18" time=2025-05-01T16:16:18.911+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.24" time=2025-05-01T16:16:19.163+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.32" time=2025-05-01T16:16:19.414+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.39" time=2025-05-01T16:16:19.665+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.46" time=2025-05-01T16:16:19.917+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.53" time=2025-05-01T16:16:20.168+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.60" time=2025-05-01T16:16:20.419+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.66" time=2025-05-01T16:16:20.670+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.73" time=2025-05-01T16:16:20.921+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.81" time=2025-05-01T16:16:21.172+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.87" time=2025-05-01T16:16:21.423+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.94" llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 2048 llama_context: n_ctx_per_seq = 2048 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (2048) < n_ctx_train (1024000) -- the full capacity of the model will not be utilized set_abort_callback: call llama_context: CPU output buffer size = 0.52 MiB llama_context: n_ctx = 2048 llama_context: n_ctx = 2048 (padded) init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1 init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: CUDA0 KV buffer size = 304.00 MiB init: CPU KV buffer size = 16.00 MiB llama_context: KV self size = 320.00 MiB, K (f16): 160.00 MiB, V (f16): 160.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 2 llama_context: max_nodes = 65536 llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: reserving graph for n_tokens = 1, n_seqs = 1 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: CUDA0 compute buffer size = 791.00 MiB llama_context: CUDA_Host compute buffer size = 14.01 MiB llama_context: graph nodes = 1366 llama_context: graph splits = 26 (with bs=512), 3 (with bs=1) time=2025-05-01T16:16:21.674+02:00 level=INFO source=server.go:619 msg="llama runner started in 5.79 seconds" time=2025-05-01T16:16:21.674+02:00 level=DEBUG source=sched.go:464 msg="finished setting up runner" model=C:\Ollama_Models\blobs\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 [GIN] 2025/05/01 - 16:16:21 | 200 | 6.582344s | 127.0.0.1 | POST "/api/generate" time=2025-05-01T16:16:21.675+02:00 level=DEBUG source=sched.go:468 msg="context for request finished" time=2025-05-01T16:16:21.675+02:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Ollama_Models\blobs\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 duration=5m0s time=2025-05-01T16:16:21.675+02:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" modelPath=C:\Ollama_Models\blobs\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 refCount=0 ``` M5000 seems to works fine. Problem is present only when I use the 2 GPU

GiteaMirror commented

2026-04-29 03:08:57 -05:00

@neoxeo commented on GitHub (May 1, 2025):

If I try to load same model on A2000, it works fine too :

load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
time=2025-05-01T16:25:21.324+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.05"
time=2025-05-01T16:25:21.575+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.16"
time=2025-05-01T16:25:21.826+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.23"
time=2025-05-01T16:25:22.078+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.30"
time=2025-05-01T16:25:22.329+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.37"
time=2025-05-01T16:25:22.580+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.44"
time=2025-05-01T16:25:22.831+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.50"
time=2025-05-01T16:25:23.083+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.58"
time=2025-05-01T16:25:23.334+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.65"
time=2025-05-01T16:25:23.586+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.71"
time=2025-05-01T16:25:23.837+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.78"
time=2025-05-01T16:25:24.088+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.85"
time=2025-05-01T16:25:24.340+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.92"
time=2025-05-01T16:25:24.591+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.99"
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (2048) < n_ctx_train (1024000) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     2.08 MiB
llama_context: n_ctx = 8192
llama_context: n_ctx = 8192 (padded)
init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
init: layer   0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init:      CUDA0 KV buffer size =  1280.00 MiB
llama_context: KV self size  = 1280.00 MiB, K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      CUDA0 compute buffer size =   564.00 MiB
llama_context:  CUDA_Host compute buffer size =    26.01 MiB
llama_context: graph nodes  = 1366
llama_context: graph splits = 2
time=2025-05-01T16:25:24.843+02:00 level=INFO source=server.go:619 msg="llama runner started in 5.28 seconds"
time=2025-05-01T16:25:24.843+02:00 level=DEBUG source=sched.go:464 msg="finished setting up runner" model=C:\Ollama_Models\blobs\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94
[GIN] 2025/05/01 - 16:25:24 | 200 |    6.0961547s |       127.0.0.1 | POST     "/api/generate"
time=2025-05-01T16:25:24.843+02:00 level=DEBUG source=sched.go:468 msg="context for request finished"
time=2025-05-01T16:25:24.843+02:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Ollama_Models\blobs\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 duration=5m0s
time=2025-05-01T16:25:24.843+02:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" modelPath=C:\Ollama_Models\blobs\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 refCount=0

@neoxeo commented on GitHub (May 1, 2025): If I try to load same model on A2000, it works fine too : ![Image](https://github.com/user-attachments/assets/108e9498-a210-4379-8f22-e815f689e5cb) ``` load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2025-05-01T16:25:21.324+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.05" time=2025-05-01T16:25:21.575+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.16" time=2025-05-01T16:25:21.826+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.23" time=2025-05-01T16:25:22.078+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.30" time=2025-05-01T16:25:22.329+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.37" time=2025-05-01T16:25:22.580+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.44" time=2025-05-01T16:25:22.831+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.50" time=2025-05-01T16:25:23.083+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.58" time=2025-05-01T16:25:23.334+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.65" time=2025-05-01T16:25:23.586+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.71" time=2025-05-01T16:25:23.837+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.78" time=2025-05-01T16:25:24.088+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.85" time=2025-05-01T16:25:24.340+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.92" time=2025-05-01T16:25:24.591+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.99" llama_context: constructing llama_context llama_context: n_seq_max = 4 llama_context: n_ctx = 8192 llama_context: n_ctx_per_seq = 2048 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (2048) < n_ctx_train (1024000) -- the full capacity of the model will not be utilized set_abort_callback: call llama_context: CUDA_Host output buffer size = 2.08 MiB llama_context: n_ctx = 8192 llama_context: n_ctx = 8192 (padded) init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1 init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: layer 39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 init: CUDA0 KV buffer size = 1280.00 MiB llama_context: KV self size = 1280.00 MiB, K (f16): 640.00 MiB, V (f16): 640.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 2 llama_context: max_nodes = 65536 llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: reserving graph for n_tokens = 1, n_seqs = 1 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: CUDA0 compute buffer size = 564.00 MiB llama_context: CUDA_Host compute buffer size = 26.01 MiB llama_context: graph nodes = 1366 llama_context: graph splits = 2 time=2025-05-01T16:25:24.843+02:00 level=INFO source=server.go:619 msg="llama runner started in 5.28 seconds" time=2025-05-01T16:25:24.843+02:00 level=DEBUG source=sched.go:464 msg="finished setting up runner" model=C:\Ollama_Models\blobs\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 [GIN] 2025/05/01 - 16:25:24 | 200 | 6.0961547s | 127.0.0.1 | POST "/api/generate" time=2025-05-01T16:25:24.843+02:00 level=DEBUG source=sched.go:468 msg="context for request finished" time=2025-05-01T16:25:24.843+02:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Ollama_Models\blobs\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 duration=5m0s time=2025-05-01T16:25:24.843+02:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" modelPath=C:\Ollama_Models\blobs\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 refCount=0 ```

GiteaMirror commented

2026-04-29 03:08:58 -05:00

@rick-github commented on GitHub (May 1, 2025):

What happens if you change the order of the GPUS? CUDA_VISIBLE_DEVICES=1,0

@rick-github commented on GitHub (May 1, 2025): What happens if you change the order of the GPUS? `CUDA_VISIBLE_DEVICES=1,0`

GiteaMirror commented

2026-04-29 03:08:58 -05:00

@neoxeo commented on GitHub (May 1, 2025):

PS C:> $env:OLLAMA_DEBUG="1"; $env:CUDA_VISIBLE_DEVICES="1,0" ; & "ollama app.exe"
PS C:> ollama run qwen3:30b-a3b --verbose
=> Error

2025/05/01 16:29:01 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES:1,0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Ollama_Models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-05-01T16:29:01.474+02:00 level=INFO source=images.go:458 msg="total blobs: 52"
time=2025-05-01T16:29:01.496+02:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0"
time=2025-05-01T16:29:01.508+02:00 level=INFO source=routes.go:1299 msg="Listening on 127.0.0.1:11434 (version 0.6.6)"
time=2025-05-01T16:29:01.508+02:00 level=DEBUG source=sched.go:107 msg="starting llm scheduler"
time=2025-05-01T16:29:01.508+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-05-01T16:29:01.514+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-05-01T16:29:01.514+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=0 threads=28
time=2025-05-01T16:29:01.514+02:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-05-01T16:29:01.514+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll
time=2025-05-01T16:29:01.514+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvml.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvml.dll C:\\Python\\miniforge3\\nvml.dll C:\\Python\\miniforge3\\Scripts\\nvml.dll C:\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2025-05-01T16:29:01.521+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll"
time=2025-05-01T16:29:01.526+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2025-05-01T16:29:01.577+02:00 level=DEBUG source=gpu.go:111 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll
time=2025-05-01T16:29:01.577+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvcuda.dll
time=2025-05-01T16:29:01.579+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvcuda.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvcuda.dll C:\\Python\\miniforge3\\nvcuda.dll C:\\Python\\miniforge3\\Scripts\\nvcuda.dll C:\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]"
time=2025-05-01T16:29:01.587+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll"
time=2025-05-01T16:29:01.594+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll]
initializing C:\Windows\system32\nvcuda.dll
dlsym: cuInit - 00007FFACF2E1F80
dlsym: cuDriverGetVersion - 00007FFACF2E2020
dlsym: cuDeviceGetCount - 00007FFACF2E2816
dlsym: cuDeviceGet - 00007FFACF2E2810
dlsym: cuDeviceGetAttribute - 00007FFACF2E2170
dlsym: cuDeviceGetUuid - 00007FFACF2E2822
dlsym: cuDeviceGetName - 00007FFACF2E281C
dlsym: cuCtxCreate_v3 - 00007FFACF2E2894
dlsym: cuMemGetInfo_v2 - 00007FFACF2E2996
dlsym: cuCtxDestroy - 00007FFACF2E28A6
calling cuInit
calling cuDriverGetVersion
raw version 0x2f3a
CUDA driver version: 12.9
calling cuDeviceGetCount
device count 2
time=2025-05-01T16:29:01.647+02:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll
[GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA totalMem 8191 mb
[GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA freeMem 7296 mb
[GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] Compute Capability 5.2
[GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA totalMem 12281 mb
[GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA freeMem 11248 mb
[GPU-962a842b-b382-6457-65a1-3cffec62ba6f] Compute Capability 8.6
time=2025-05-01T16:29:02.011+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda compute=8.6 driver=12.9 name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB"
time=2025-05-01T16:29:02.021+02:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: Le module spécifié est introuvable."
releasing cuda driver library
releasing nvml library
time=2025-05-01T16:29:02.023+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 library=cuda variant=v12 compute=5.2 driver=12.9 name="Quadro M5000" total="8.0 GiB" available="7.1 GiB"
time=2025-05-01T16:29:02.023+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA RTX A2000 12GB" total="12.0 GiB" available="11.0 GiB"
[GIN] 2025/05/01 - 16:29:18 | 200 |            0s |       127.0.0.1 | HEAD     "/"
time=2025-05-01T16:29:18.701+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T16:29:18.761+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
[GIN] 2025/05/01 - 16:29:18 | 200 |    139.6526ms |       127.0.0.1 | POST     "/api/show"
time=2025-05-01T16:29:19.016+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T16:29:19.027+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="118.8 GiB" before.free_swap="125.1 GiB" now.total="127.9 GiB" now.free="118.7 GiB" now.free_swap="125.0 GiB"
time=2025-05-01T16:29:19.054+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="908.4 MiB"
time=2025-05-01T16:29:19.070+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB"
releasing nvml library
time=2025-05-01T16:29:19.072+02:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=6 gpu_count=2
time=2025-05-01T16:29:19.103+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T16:29:19.164+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T16:29:19.176+02:00 level=DEBUG source=sched.go:226 msg="loading first model" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T16:29:19.176+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2025-05-01T16:29:19.176+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T16:29:19.178+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.1 GiB]"
time=2025-05-01T16:29:19.178+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T16:29:19.180+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2025-05-01T16:29:19.180+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T16:29:19.181+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.1 GiB]"
time=2025-05-01T16:29:19.181+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T16:29:19.183+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]"
time=2025-05-01T16:29:19.183+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T16:29:19.184+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]"
time=2025-05-01T16:29:19.185+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T16:29:19.187+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="118.7 GiB" before.free_swap="125.0 GiB" now.total="127.9 GiB" now.free="118.7 GiB" now.free_swap="125.0 GiB"
time=2025-05-01T16:29:19.211+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="908.4 MiB"
time=2025-05-01T16:29:19.227+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB"
releasing nvml library
time=2025-05-01T16:29:19.229+02:00 level=INFO source=server.go:105 msg="system memory" total="127.9 GiB" free="118.7 GiB" free_swap="125.0 GiB"
time=2025-05-01T16:29:19.229+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[7.1 GiB 11.0 GiB]"
time=2025-05-01T16:29:19.229+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T16:29:19.231+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=43 layers.split=16,27 memory.available="[7.1 GiB 11.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="19.5 GiB" memory.required.partial="17.4 GiB" memory.required.kv="192.0 MiB" memory.required.allocations="[6.7 GiB 10.6 GiB]" memory.weights.total="17.2 GiB" memory.weights.repeating="16.9 GiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB"
time=2025-05-01T16:29:19.232+02:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]"
llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv   7:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   8:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv   9:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  15:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  16:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  17:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  18:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type  f16:   48 tensors
llama_model_loader: - type q4_K:  265 tensors
llama_model_loader: - type q6_K:   25 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 17.34 GiB (4.88 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B
print_info: n_ff_exp         = 0
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-05-01T16:29:19.697+02:00 level=DEBUG source=server.go:335 msg="adding gpu library" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
time=2025-05-01T16:29:19.697+02:00 level=DEBUG source=server.go:343 msg="adding gpu dependency paths" paths=[C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12]
time=2025-05-01T16:29:19.697+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Ollama_Models\\blobs\\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac --ctx-size 2048 --batch-size 512 --n-gpu-layers 43 --verbose --threads 14 --no-mmap --parallel 1 --tensor-split 16,27 --port 50009"
time=2025-05-01T16:29:19.697+02:00 level=DEBUG source=server.go:423 msg=subprocess environment="[CUDA_VISIBLE_DEVICES=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385,GPU-962a842b-b382-6457-65a1-3cffec62ba6f PATH=C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama;C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\Testeur1\\.dotnet\\tools;C:\\Python\\miniforge3;C:\\Python\\miniforge3\\Scripts;;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama]"
time=2025-05-01T16:29:19.716+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-05-01T16:29:19.716+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-05-01T16:29:19.718+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-05-01T16:29:19.795+02:00 level=INFO source=runner.go:853 msg="starting go runner"
time=2025-05-01T16:29:19.951+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Quadro M5000, compute capability 5.2, VMM: yes
  Device 1: NVIDIA RTX A2000 12GB, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-05-01T16:29:20.423+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin"
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS"
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS"
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\system32
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\Wbem
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\WindowsPowerShell\v1.0
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\OpenSSH
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Git\\cmd"
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL"
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL"
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT"
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT"
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\dotnet"
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common"
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin"
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Programs\Python\Launcher
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Microsoft\WindowsApps
time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama
time=2025-05-01T16:29:20.440+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\GitHubDesktop\bin
time=2025-05-01T16:29:20.440+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\.dotnet\tools
time=2025-05-01T16:29:20.440+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3
time=2025-05-01T16:29:20.440+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3\Scripts
time=2025-05-01T16:29:20.440+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\
time=2025-05-01T16:29:20.440+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
time=2025-05-01T16:29:20.454+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-05-01T16:29:20.473+02:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:50009"
time=2025-05-01T16:29:20.476+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
llama_model_load_from_file_impl: using device CUDA0 (Quadro M5000) - 7296 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA RTX A2000 12GB) - 11248 MiB free
llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv   7:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   8:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv   9:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  15:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  16:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  17:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  18:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type  f16:   48 tensors
llama_model_loader: - type q4_K:  265 tensors
llama_model_loader: - type q6_K:   25 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 17.34 GiB (4.88 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 2048
print_info: n_layer          = 48
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = ?B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B
print_info: n_ff_exp         = 768
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
load_tensors: layer  21 assigned to device CUDA1, is_swa = 0
load_tensors: layer  22 assigned to device CUDA1, is_swa = 0
load_tensors: layer  23 assigned to device CUDA1, is_swa = 0
load_tensors: layer  24 assigned to device CUDA1, is_swa = 0
load_tensors: layer  25 assigned to device CUDA1, is_swa = 0
load_tensors: layer  26 assigned to device CUDA1, is_swa = 0
load_tensors: layer  27 assigned to device CUDA1, is_swa = 0
load_tensors: layer  28 assigned to device CUDA1, is_swa = 0
load_tensors: layer  29 assigned to device CUDA1, is_swa = 0
load_tensors: layer  30 assigned to device CUDA1, is_swa = 0
load_tensors: layer  31 assigned to device CUDA1, is_swa = 0
load_tensors: layer  32 assigned to device CUDA1, is_swa = 0
load_tensors: layer  33 assigned to device CUDA1, is_swa = 0
load_tensors: layer  34 assigned to device CUDA1, is_swa = 0
load_tensors: layer  35 assigned to device CUDA1, is_swa = 0
load_tensors: layer  36 assigned to device CUDA1, is_swa = 0
load_tensors: layer  37 assigned to device CUDA1, is_swa = 0
load_tensors: layer  38 assigned to device CUDA1, is_swa = 0
load_tensors: layer  39 assigned to device CUDA1, is_swa = 0
load_tensors: layer  40 assigned to device CUDA1, is_swa = 0
load_tensors: layer  41 assigned to device CUDA1, is_swa = 0
load_tensors: layer  42 assigned to device CUDA1, is_swa = 0
load_tensors: layer  43 assigned to device CUDA1, is_swa = 0
load_tensors: layer  44 assigned to device CUDA1, is_swa = 0
load_tensors: layer  45 assigned to device CUDA1, is_swa = 0
load_tensors: layer  46 assigned to device CUDA1, is_swa = 0
load_tensors: layer  47 assigned to device CUDA1, is_swa = 0
load_tensors: layer  48 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 43 repeating layers to GPU
load_tensors: offloaded 43/49 layers to GPU
load_tensors:    CUDA_Host model buffer size =  2173.83 MiB
load_tensors:        CUDA0 model buffer size =  5682.27 MiB
load_tensors:        CUDA1 model buffer size =  9731.14 MiB
load_tensors:          CPU model buffer size =   166.92 MiB
load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads
time=2025-05-01T16:29:21.983+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.01"
time=2025-05-01T16:29:22.234+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.04"
time=2025-05-01T16:29:22.485+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.06"
time=2025-05-01T16:29:22.737+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.10"
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
time=2025-05-01T16:29:22.988+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.12"
time=2025-05-01T16:34:23.156+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.12 - "
time=2025-05-01T16:34:23.156+02:00 level=DEBUG source=sched.go:460 msg="triggering expiration for failed load" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T16:34:23.156+02:00 level=DEBUG source=sched.go:362 msg="runner expired event received" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T16:34:23.157+02:00 level=DEBUG source=sched.go:377 msg="got lock to unload" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
[GIN] 2025/05/01 - 16:34:23 | 500 |          5m4s |       127.0.0.1 | POST     "/api/generate"
time=2025-05-01T16:34:23.157+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="118.7 GiB" before.free_swap="125.0 GiB" now.total="127.9 GiB" now.free="116.4 GiB" now.free_swap="107.3 GiB"
time=2025-05-01T16:34:23.176+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="1.5 GiB" now.used="6.5 GiB"
time=2025-05-01T16:34:23.191+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="1.4 GiB" now.used="9.8 GiB"
releasing nvml library
time=2025-05-01T16:34:23.223+02:00 level=DEBUG source=server.go:1001 msg="stopping llama server"
time=2025-05-01T16:34:23.232+02:00 level=DEBUG source=server.go:1007 msg="waiting for llama server to exit"
time=2025-05-01T16:34:23.444+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="116.4 GiB" before.free_swap="107.3 GiB" now.total="127.9 GiB" now.free="116.4 GiB" now.free_swap="117.3 GiB"
time=2025-05-01T16:34:23.617+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="1.5 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="934.5 MiB"
time=2025-05-01T16:34:23.633+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="1.4 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB"
releasing nvml library
time=2025-05-01T16:34:23.635+02:00 level=DEBUG source=sched.go:661 msg="gpu VRAM free memory converged after 0.48 seconds" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T16:34:23.762+02:00 level=DEBUG source=server.go:1011 msg="llama server stopped"
time=2025-05-01T16:34:23.762+02:00 level=DEBUG source=sched.go:382 msg="runner released" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T16:34:23.762+02:00 level=DEBUG source=sched.go:386 msg="sending an unloaded event" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T16:34:23.762+02:00 level=DEBUG source=sched.go:310 msg="ignoring unload event with no pending requests"

I try now CUDA_VISIBLE_DEVICES="0,1"

@neoxeo commented on GitHub (May 1, 2025): PS C:\> $env:OLLAMA_DEBUG="1"; $env:CUDA_VISIBLE_DEVICES="1,0" ; & "ollama app.exe" PS C:\> ollama run qwen3:30b-a3b --verbose => Error ![Image](https://github.com/user-attachments/assets/95742424-6beb-46a9-9d01-96858e743e78) ``` 2025/05/01 16:29:01 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES:1,0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Ollama_Models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-05-01T16:29:01.474+02:00 level=INFO source=images.go:458 msg="total blobs: 52" time=2025-05-01T16:29:01.496+02:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0" time=2025-05-01T16:29:01.508+02:00 level=INFO source=routes.go:1299 msg="Listening on 127.0.0.1:11434 (version 0.6.6)" time=2025-05-01T16:29:01.508+02:00 level=DEBUG source=sched.go:107 msg="starting llm scheduler" time=2025-05-01T16:29:01.508+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-05-01T16:29:01.514+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-05-01T16:29:01.514+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=0 threads=28 time=2025-05-01T16:29:01.514+02:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-05-01T16:29:01.514+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll time=2025-05-01T16:29:01.514+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvml.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvml.dll C:\\Python\\miniforge3\\nvml.dll C:\\Python\\miniforge3\\Scripts\\nvml.dll C:\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-05-01T16:29:01.521+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll" time=2025-05-01T16:29:01.526+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-05-01T16:29:01.577+02:00 level=DEBUG source=gpu.go:111 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll time=2025-05-01T16:29:01.577+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvcuda.dll time=2025-05-01T16:29:01.579+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvcuda.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvcuda.dll C:\\Python\\miniforge3\\nvcuda.dll C:\\Python\\miniforge3\\Scripts\\nvcuda.dll C:\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]" time=2025-05-01T16:29:01.587+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll" time=2025-05-01T16:29:01.594+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll] initializing C:\Windows\system32\nvcuda.dll dlsym: cuInit - 00007FFACF2E1F80 dlsym: cuDriverGetVersion - 00007FFACF2E2020 dlsym: cuDeviceGetCount - 00007FFACF2E2816 dlsym: cuDeviceGet - 00007FFACF2E2810 dlsym: cuDeviceGetAttribute - 00007FFACF2E2170 dlsym: cuDeviceGetUuid - 00007FFACF2E2822 dlsym: cuDeviceGetName - 00007FFACF2E281C dlsym: cuCtxCreate_v3 - 00007FFACF2E2894 dlsym: cuMemGetInfo_v2 - 00007FFACF2E2996 dlsym: cuCtxDestroy - 00007FFACF2E28A6 calling cuInit calling cuDriverGetVersion raw version 0x2f3a CUDA driver version: 12.9 calling cuDeviceGetCount device count 2 time=2025-05-01T16:29:01.647+02:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll [GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA totalMem 8191 mb [GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA freeMem 7296 mb [GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] Compute Capability 5.2 [GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA totalMem 12281 mb [GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA freeMem 11248 mb [GPU-962a842b-b382-6457-65a1-3cffec62ba6f] Compute Capability 8.6 time=2025-05-01T16:29:02.011+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda compute=8.6 driver=12.9 name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" time=2025-05-01T16:29:02.021+02:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: Le module spécifié est introuvable." releasing cuda driver library releasing nvml library time=2025-05-01T16:29:02.023+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 library=cuda variant=v12 compute=5.2 driver=12.9 name="Quadro M5000" total="8.0 GiB" available="7.1 GiB" time=2025-05-01T16:29:02.023+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA RTX A2000 12GB" total="12.0 GiB" available="11.0 GiB" [GIN] 2025/05/01 - 16:29:18 | 200 | 0s | 127.0.0.1 | HEAD "/" time=2025-05-01T16:29:18.701+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T16:29:18.761+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 [GIN] 2025/05/01 - 16:29:18 | 200 | 139.6526ms | 127.0.0.1 | POST "/api/show" time=2025-05-01T16:29:19.016+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T16:29:19.027+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="118.8 GiB" before.free_swap="125.1 GiB" now.total="127.9 GiB" now.free="118.7 GiB" now.free_swap="125.0 GiB" time=2025-05-01T16:29:19.054+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="908.4 MiB" time=2025-05-01T16:29:19.070+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB" releasing nvml library time=2025-05-01T16:29:19.072+02:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=6 gpu_count=2 time=2025-05-01T16:29:19.103+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T16:29:19.164+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T16:29:19.176+02:00 level=DEBUG source=sched.go:226 msg="loading first model" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T16:29:19.176+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]" time=2025-05-01T16:29:19.176+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T16:29:19.178+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.1 GiB]" time=2025-05-01T16:29:19.178+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T16:29:19.180+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]" time=2025-05-01T16:29:19.180+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T16:29:19.181+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.1 GiB]" time=2025-05-01T16:29:19.181+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T16:29:19.183+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]" time=2025-05-01T16:29:19.183+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T16:29:19.184+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]" time=2025-05-01T16:29:19.185+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T16:29:19.187+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="118.7 GiB" before.free_swap="125.0 GiB" now.total="127.9 GiB" now.free="118.7 GiB" now.free_swap="125.0 GiB" time=2025-05-01T16:29:19.211+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="908.4 MiB" time=2025-05-01T16:29:19.227+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB" releasing nvml library time=2025-05-01T16:29:19.229+02:00 level=INFO source=server.go:105 msg="system memory" total="127.9 GiB" free="118.7 GiB" free_swap="125.0 GiB" time=2025-05-01T16:29:19.229+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[7.1 GiB 11.0 GiB]" time=2025-05-01T16:29:19.229+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T16:29:19.231+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=43 layers.split=16,27 memory.available="[7.1 GiB 11.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="19.5 GiB" memory.required.partial="17.4 GiB" memory.required.kv="192.0 MiB" memory.required.allocations="[6.7 GiB 10.6 GiB]" memory.weights.total="17.2 GiB" memory.weights.repeating="16.9 GiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB" time=2025-05-01T16:29:19.232+02:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]" llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 30B A3B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 30B-A3B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 7: qwen3moe.context_length u32 = 40960 llama_model_loader: - kv 8: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 9: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 10: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen3moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 15: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 16: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 17: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 18: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 20: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 15 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type f16: 48 tensors llama_model_loader: - type q4_K: 265 tensors llama_model_loader: - type q6_K: 25 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 17.34 GiB (4.88 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 30.53 B print_info: general.name = Qwen3 30B A3B print_info: n_ff_exp = 0 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-05-01T16:29:19.697+02:00 level=DEBUG source=server.go:335 msg="adding gpu library" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 time=2025-05-01T16:29:19.697+02:00 level=DEBUG source=server.go:343 msg="adding gpu dependency paths" paths=[C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12] time=2025-05-01T16:29:19.697+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Ollama_Models\\blobs\\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac --ctx-size 2048 --batch-size 512 --n-gpu-layers 43 --verbose --threads 14 --no-mmap --parallel 1 --tensor-split 16,27 --port 50009" time=2025-05-01T16:29:19.697+02:00 level=DEBUG source=server.go:423 msg=subprocess environment="[CUDA_VISIBLE_DEVICES=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385,GPU-962a842b-b382-6457-65a1-3cffec62ba6f PATH=C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama;C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\Testeur1\\.dotnet\\tools;C:\\Python\\miniforge3;C:\\Python\\miniforge3\\Scripts;;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama]" time=2025-05-01T16:29:19.716+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-05-01T16:29:19.716+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-05-01T16:29:19.718+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-05-01T16:29:19.795+02:00 level=INFO source=runner.go:853 msg="starting go runner" time=2025-05-01T16:29:19.951+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: Quadro M5000, compute capability 5.2, VMM: yes Device 1: NVIDIA RTX A2000 12GB, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-05-01T16:29:20.423+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin" time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS" time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS" time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\system32 time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\Wbem time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\WindowsPowerShell\v1.0 time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\OpenSSH time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Git\\cmd" time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL" time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL" time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT" time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT" time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\dotnet" time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common" time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin" time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Programs\Python\Launcher time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Microsoft\WindowsApps time=2025-05-01T16:29:20.430+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama time=2025-05-01T16:29:20.440+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\GitHubDesktop\bin time=2025-05-01T16:29:20.440+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\.dotnet\tools time=2025-05-01T16:29:20.440+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3 time=2025-05-01T16:29:20.440+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3\Scripts time=2025-05-01T16:29:20.440+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\ time=2025-05-01T16:29:20.440+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll time=2025-05-01T16:29:20.454+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-05-01T16:29:20.473+02:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:50009" time=2025-05-01T16:29:20.476+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" llama_model_load_from_file_impl: using device CUDA0 (Quadro M5000) - 7296 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA RTX A2000 12GB) - 11248 MiB free llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 30B A3B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 30B-A3B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 7: qwen3moe.context_length u32 = 40960 llama_model_loader: - kv 8: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 9: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 10: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen3moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 15: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 16: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 17: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 18: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 20: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 15 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type f16: 48 tensors llama_model_loader: - type q4_K: 265 tensors llama_model_loader: - type q6_K: 25 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 17.34 GiB (4.88 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 0 print_info: n_ctx_train = 40960 print_info: n_embd = 2048 print_info: n_layer = 48 print_info: n_head = 32 print_info: n_head_kv = 4 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 6144 print_info: n_expert = 128 print_info: n_expert_used = 8 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 40960 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = ?B print_info: model params = 30.53 B print_info: general.name = Qwen3 30B A3B print_info: n_ff_exp = 768 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: layer 0 assigned to device CPU, is_swa = 0 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 0 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CPU, is_swa = 0 load_tensors: layer 5 assigned to device CUDA0, is_swa = 0 load_tensors: layer 6 assigned to device CUDA0, is_swa = 0 load_tensors: layer 7 assigned to device CUDA0, is_swa = 0 load_tensors: layer 8 assigned to device CUDA0, is_swa = 0 load_tensors: layer 9 assigned to device CUDA0, is_swa = 0 load_tensors: layer 10 assigned to device CUDA0, is_swa = 0 load_tensors: layer 11 assigned to device CUDA0, is_swa = 0 load_tensors: layer 12 assigned to device CUDA0, is_swa = 0 load_tensors: layer 13 assigned to device CUDA0, is_swa = 0 load_tensors: layer 14 assigned to device CUDA0, is_swa = 0 load_tensors: layer 15 assigned to device CUDA0, is_swa = 0 load_tensors: layer 16 assigned to device CUDA0, is_swa = 0 load_tensors: layer 17 assigned to device CUDA0, is_swa = 0 load_tensors: layer 18 assigned to device CUDA0, is_swa = 0 load_tensors: layer 19 assigned to device CUDA0, is_swa = 0 load_tensors: layer 20 assigned to device CUDA0, is_swa = 0 load_tensors: layer 21 assigned to device CUDA1, is_swa = 0 load_tensors: layer 22 assigned to device CUDA1, is_swa = 0 load_tensors: layer 23 assigned to device CUDA1, is_swa = 0 load_tensors: layer 24 assigned to device CUDA1, is_swa = 0 load_tensors: layer 25 assigned to device CUDA1, is_swa = 0 load_tensors: layer 26 assigned to device CUDA1, is_swa = 0 load_tensors: layer 27 assigned to device CUDA1, is_swa = 0 load_tensors: layer 28 assigned to device CUDA1, is_swa = 0 load_tensors: layer 29 assigned to device CUDA1, is_swa = 0 load_tensors: layer 30 assigned to device CUDA1, is_swa = 0 load_tensors: layer 31 assigned to device CUDA1, is_swa = 0 load_tensors: layer 32 assigned to device CUDA1, is_swa = 0 load_tensors: layer 33 assigned to device CUDA1, is_swa = 0 load_tensors: layer 34 assigned to device CUDA1, is_swa = 0 load_tensors: layer 35 assigned to device CUDA1, is_swa = 0 load_tensors: layer 36 assigned to device CUDA1, is_swa = 0 load_tensors: layer 37 assigned to device CUDA1, is_swa = 0 load_tensors: layer 38 assigned to device CUDA1, is_swa = 0 load_tensors: layer 39 assigned to device CUDA1, is_swa = 0 load_tensors: layer 40 assigned to device CUDA1, is_swa = 0 load_tensors: layer 41 assigned to device CUDA1, is_swa = 0 load_tensors: layer 42 assigned to device CUDA1, is_swa = 0 load_tensors: layer 43 assigned to device CUDA1, is_swa = 0 load_tensors: layer 44 assigned to device CUDA1, is_swa = 0 load_tensors: layer 45 assigned to device CUDA1, is_swa = 0 load_tensors: layer 46 assigned to device CUDA1, is_swa = 0 load_tensors: layer 47 assigned to device CUDA1, is_swa = 0 load_tensors: layer 48 assigned to device CPU, is_swa = 0 load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead load_tensors: offloading 43 repeating layers to GPU load_tensors: offloaded 43/49 layers to GPU load_tensors: CUDA_Host model buffer size = 2173.83 MiB load_tensors: CUDA0 model buffer size = 5682.27 MiB load_tensors: CUDA1 model buffer size = 9731.14 MiB load_tensors: CPU model buffer size = 166.92 MiB load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads time=2025-05-01T16:29:21.983+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.01" time=2025-05-01T16:29:22.234+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.04" time=2025-05-01T16:29:22.485+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.06" time=2025-05-01T16:29:22.737+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.10" load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2025-05-01T16:29:22.988+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.12" time=2025-05-01T16:34:23.156+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.12 - " time=2025-05-01T16:34:23.156+02:00 level=DEBUG source=sched.go:460 msg="triggering expiration for failed load" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T16:34:23.156+02:00 level=DEBUG source=sched.go:362 msg="runner expired event received" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T16:34:23.157+02:00 level=DEBUG source=sched.go:377 msg="got lock to unload" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac [GIN] 2025/05/01 - 16:34:23 | 500 | 5m4s | 127.0.0.1 | POST "/api/generate" time=2025-05-01T16:34:23.157+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="118.7 GiB" before.free_swap="125.0 GiB" now.total="127.9 GiB" now.free="116.4 GiB" now.free_swap="107.3 GiB" time=2025-05-01T16:34:23.176+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="1.5 GiB" now.used="6.5 GiB" time=2025-05-01T16:34:23.191+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="1.4 GiB" now.used="9.8 GiB" releasing nvml library time=2025-05-01T16:34:23.223+02:00 level=DEBUG source=server.go:1001 msg="stopping llama server" time=2025-05-01T16:34:23.232+02:00 level=DEBUG source=server.go:1007 msg="waiting for llama server to exit" time=2025-05-01T16:34:23.444+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="116.4 GiB" before.free_swap="107.3 GiB" now.total="127.9 GiB" now.free="116.4 GiB" now.free_swap="117.3 GiB" time=2025-05-01T16:34:23.617+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="1.5 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="934.5 MiB" time=2025-05-01T16:34:23.633+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="1.4 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB" releasing nvml library time=2025-05-01T16:34:23.635+02:00 level=DEBUG source=sched.go:661 msg="gpu VRAM free memory converged after 0.48 seconds" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T16:34:23.762+02:00 level=DEBUG source=server.go:1011 msg="llama server stopped" time=2025-05-01T16:34:23.762+02:00 level=DEBUG source=sched.go:382 msg="runner released" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T16:34:23.762+02:00 level=DEBUG source=sched.go:386 msg="sending an unloaded event" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T16:34:23.762+02:00 level=DEBUG source=sched.go:310 msg="ignoring unload event with no pending requests" ``` I try now CUDA_VISIBLE_DEVICES="0,1"

GiteaMirror commented

2026-04-29 03:08:59 -05:00

@neoxeo commented on GitHub (May 1, 2025):

PS C:> $env:OLLAMA_DEBUG="1"; $env:CUDA_VISIBLE_DEVICES="0,1" ; & "ollama app.exe"
PS C:> ollama run qwen3:30b-a3b --verbose
=> Error

Logs :

2025/05/01 16:40:02 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Ollama_Models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-05-01T16:40:02.855+02:00 level=INFO source=images.go:458 msg="total blobs: 52"
time=2025-05-01T16:40:02.874+02:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0"
time=2025-05-01T16:40:02.893+02:00 level=INFO source=routes.go:1299 msg="Listening on 127.0.0.1:11434 (version 0.6.6)"
time=2025-05-01T16:40:02.893+02:00 level=DEBUG source=sched.go:107 msg="starting llm scheduler"
time=2025-05-01T16:40:02.894+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-05-01T16:40:02.894+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-05-01T16:40:02.894+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=0 threads=28
time=2025-05-01T16:40:02.894+02:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-05-01T16:40:02.894+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll
time=2025-05-01T16:40:02.894+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvml.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvml.dll C:\\Python\\miniforge3\\nvml.dll C:\\Python\\miniforge3\\Scripts\\nvml.dll C:\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2025-05-01T16:40:02.901+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll"
time=2025-05-01T16:40:02.906+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2025-05-01T16:40:02.965+02:00 level=DEBUG source=gpu.go:111 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll
time=2025-05-01T16:40:02.965+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvcuda.dll
time=2025-05-01T16:40:02.972+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvcuda.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvcuda.dll C:\\Python\\miniforge3\\nvcuda.dll C:\\Python\\miniforge3\\Scripts\\nvcuda.dll C:\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]"
time=2025-05-01T16:40:02.979+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll"
time=2025-05-01T16:40:02.984+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll]
initializing C:\Windows\system32\nvcuda.dll
dlsym: cuInit - 00007FFACF351F80
dlsym: cuDriverGetVersion - 00007FFACF352020
dlsym: cuDeviceGetCount - 00007FFACF352816
dlsym: cuDeviceGet - 00007FFACF352810
dlsym: cuDeviceGetAttribute - 00007FFACF352170
dlsym: cuDeviceGetUuid - 00007FFACF352822
dlsym: cuDeviceGetName - 00007FFACF35281C
dlsym: cuCtxCreate_v3 - 00007FFACF352894
dlsym: cuMemGetInfo_v2 - 00007FFACF352996
dlsym: cuCtxDestroy - 00007FFACF3528A6
calling cuInit
calling cuDriverGetVersion
raw version 0x2f3a
CUDA driver version: 12.9
calling cuDeviceGetCount
device count 2
time=2025-05-01T16:40:03.028+02:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll
[GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA totalMem 12281 mb
[GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA freeMem 11248 mb
[GPU-962a842b-b382-6457-65a1-3cffec62ba6f] Compute Capability 8.6
time=2025-05-01T16:40:03.251+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda compute=8.6 driver=12.9 name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB"
[GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA totalMem 8191 mb
[GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA freeMem 7296 mb
[GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] Compute Capability 5.2
time=2025-05-01T16:40:03.415+02:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: Le module spécifié est introuvable."
releasing cuda driver library
releasing nvml library
time=2025-05-01T16:40:03.419+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA RTX A2000 12GB" total="12.0 GiB" available="11.0 GiB"
time=2025-05-01T16:40:03.419+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 library=cuda variant=v12 compute=5.2 driver=12.9 name="Quadro M5000" total="8.0 GiB" available="7.1 GiB"
[GIN] 2025/05/01 - 16:40:09 | 200 |            0s |       127.0.0.1 | HEAD     "/"
time=2025-05-01T16:40:09.501+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T16:40:09.543+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
[GIN] 2025/05/01 - 16:40:09 | 200 |    119.4624ms |       127.0.0.1 | POST     "/api/show"
time=2025-05-01T16:40:09.779+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T16:40:09.792+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="118.2 GiB" before.free_swap="124.4 GiB" now.total="127.9 GiB" now.free="118.2 GiB" now.free_swap="124.4 GiB"
time=2025-05-01T16:40:09.823+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB"
time=2025-05-01T16:40:09.824+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.0 GiB" now.used="986.6 MiB"
releasing nvml library
time=2025-05-01T16:40:09.825+02:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=6 gpu_count=2
time=2025-05-01T16:40:09.858+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T16:40:09.930+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T16:40:09.946+02:00 level=DEBUG source=sched.go:226 msg="loading first model" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T16:40:09.946+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2025-05-01T16:40:09.946+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T16:40:09.947+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.0 GiB]"
time=2025-05-01T16:40:09.947+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T16:40:09.948+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2025-05-01T16:40:09.948+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T16:40:09.950+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.0 GiB]"
time=2025-05-01T16:40:09.950+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T16:40:09.950+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.0 GiB]"
time=2025-05-01T16:40:09.950+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T16:40:09.952+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.0 GiB]"
time=2025-05-01T16:40:09.952+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T16:40:09.953+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="118.2 GiB" before.free_swap="124.4 GiB" now.total="127.9 GiB" now.free="118.2 GiB" now.free_swap="124.4 GiB"
time=2025-05-01T16:40:09.995+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB"
time=2025-05-01T16:40:09.995+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="7.0 GiB" now.total="8.0 GiB" now.free="7.0 GiB" now.used="986.6 MiB"
releasing nvml library
time=2025-05-01T16:40:09.997+02:00 level=INFO source=server.go:105 msg="system memory" total="127.9 GiB" free="118.2 GiB" free_swap="124.4 GiB"
time=2025-05-01T16:40:09.997+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.0 GiB]"
time=2025-05-01T16:40:09.997+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T16:40:10.003+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=43 layers.split=27,16 memory.available="[11.0 GiB 7.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="19.5 GiB" memory.required.partial="17.4 GiB" memory.required.kv="192.0 MiB" memory.required.allocations="[10.6 GiB 6.7 GiB]" memory.weights.total="17.2 GiB" memory.weights.repeating="16.9 GiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB"
time=2025-05-01T16:40:10.005+02:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]"
llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv   7:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   8:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv   9:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  15:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  16:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  17:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  18:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type  f16:   48 tensors
llama_model_loader: - type q4_K:  265 tensors
llama_model_loader: - type q6_K:   25 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 17.34 GiB (4.88 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B
print_info: n_ff_exp         = 0
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-05-01T16:40:10.389+02:00 level=DEBUG source=server.go:335 msg="adding gpu library" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
time=2025-05-01T16:40:10.389+02:00 level=DEBUG source=server.go:343 msg="adding gpu dependency paths" paths=[C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12]
time=2025-05-01T16:40:10.389+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Ollama_Models\\blobs\\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac --ctx-size 2048 --batch-size 512 --n-gpu-layers 43 --verbose --threads 14 --no-mmap --parallel 1 --tensor-split 27,16 --port 50067"
time=2025-05-01T16:40:10.389+02:00 level=DEBUG source=server.go:423 msg=subprocess environment="[CUDA_VISIBLE_DEVICES=GPU-962a842b-b382-6457-65a1-3cffec62ba6f,GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 PATH=C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama;C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\Testeur1\\.dotnet\\tools;C:\\Python\\miniforge3;C:\\Python\\miniforge3\\Scripts;;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama]"
time=2025-05-01T16:40:10.402+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-05-01T16:40:10.402+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-05-01T16:40:10.406+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-05-01T16:40:10.477+02:00 level=INFO source=runner.go:853 msg="starting go runner"
time=2025-05-01T16:40:10.627+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A2000 12GB, compute capability 8.6, VMM: yes
  Device 1: Quadro M5000, compute capability 5.2, VMM: yes
load_backend: loaded CUDA backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-05-01T16:40:11.194+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin"
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS"
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS"
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\system32
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\Wbem
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\WindowsPowerShell\v1.0
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\OpenSSH
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Git\\cmd"
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL"
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL"
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT"
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT"
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\dotnet"
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common"
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin"
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Programs\Python\Launcher
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Microsoft\WindowsApps
time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama
time=2025-05-01T16:40:11.205+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\GitHubDesktop\bin
time=2025-05-01T16:40:11.205+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\.dotnet\tools
time=2025-05-01T16:40:11.205+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3
time=2025-05-01T16:40:11.205+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3\Scripts
time=2025-05-01T16:40:11.205+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\
time=2025-05-01T16:40:11.205+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
time=2025-05-01T16:40:11.239+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-05-01T16:40:11.241+02:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:50067"
time=2025-05-01T16:40:11.416+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA RTX A2000 12GB) - 11248 MiB free
llama_model_load_from_file_impl: using device CUDA1 (Quadro M5000) - 7296 MiB free
llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv   7:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   8:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv   9:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  15:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  16:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  17:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  18:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type  f16:   48 tensors
llama_model_loader: - type q4_K:  265 tensors
llama_model_loader: - type q6_K:   25 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 17.34 GiB (4.88 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 2048
print_info: n_layer          = 48
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = ?B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B
print_info: n_ff_exp         = 768
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
load_tensors: layer  22 assigned to device CUDA0, is_swa = 0
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 0
load_tensors: layer  25 assigned to device CUDA0, is_swa = 0
load_tensors: layer  26 assigned to device CUDA0, is_swa = 0
load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
load_tensors: layer  28 assigned to device CUDA0, is_swa = 0
load_tensors: layer  29 assigned to device CUDA0, is_swa = 0
load_tensors: layer  30 assigned to device CUDA0, is_swa = 0
load_tensors: layer  31 assigned to device CUDA0, is_swa = 0
load_tensors: layer  32 assigned to device CUDA1, is_swa = 0
load_tensors: layer  33 assigned to device CUDA1, is_swa = 0
load_tensors: layer  34 assigned to device CUDA1, is_swa = 0
load_tensors: layer  35 assigned to device CUDA1, is_swa = 0
load_tensors: layer  36 assigned to device CUDA1, is_swa = 0
load_tensors: layer  37 assigned to device CUDA1, is_swa = 0
load_tensors: layer  38 assigned to device CUDA1, is_swa = 0
load_tensors: layer  39 assigned to device CUDA1, is_swa = 0
load_tensors: layer  40 assigned to device CUDA1, is_swa = 0
load_tensors: layer  41 assigned to device CUDA1, is_swa = 0
load_tensors: layer  42 assigned to device CUDA1, is_swa = 0
load_tensors: layer  43 assigned to device CUDA1, is_swa = 0
load_tensors: layer  44 assigned to device CUDA1, is_swa = 0
load_tensors: layer  45 assigned to device CUDA1, is_swa = 0
load_tensors: layer  46 assigned to device CUDA1, is_swa = 0
load_tensors: layer  47 assigned to device CUDA1, is_swa = 0
load_tensors: layer  48 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 43 repeating layers to GPU
load_tensors: offloaded 43/49 layers to GPU
load_tensors:    CUDA_Host model buffer size =  2173.83 MiB
load_tensors:        CUDA0 model buffer size =  9533.14 MiB
load_tensors:        CUDA1 model buffer size =  5880.27 MiB
load_tensors:          CPU model buffer size =   166.92 MiB
load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads
time=2025-05-01T16:40:12.673+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.01"
time=2025-05-01T16:40:12.925+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.04"
time=2025-05-01T16:40:13.176+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.06"
time=2025-05-01T16:40:13.428+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.09"
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
time=2025-05-01T16:40:13.679+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.12"
time=2025-05-01T16:40:13.930+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.15"
time=2025-05-01T16:40:14.182+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.18"
time=2025-05-01T16:40:14.433+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.20"
time=2025-05-01T16:40:14.684+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.22"
time=2025-05-01T16:40:14.935+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.25"
time=2025-05-01T16:40:15.187+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.28"
time=2025-05-01T16:40:15.438+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.30"
time=2025-05-01T16:40:15.690+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.32"
time=2025-05-01T16:40:15.940+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.35"
time=2025-05-01T16:40:16.192+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.37"
time=2025-05-01T16:40:16.443+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.40"
time=2025-05-01T16:40:16.694+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.42"
time=2025-05-01T16:40:16.946+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.45"
time=2025-05-01T16:40:17.197+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.47"
time=2025-05-01T16:40:17.448+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.50"
time=2025-05-01T16:40:17.699+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.52"
time=2025-05-01T16:40:17.951+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.55"
time=2025-05-01T16:40:18.202+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.57"
time=2025-05-01T16:40:18.453+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.59"
time=2025-05-01T16:40:18.704+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.62"
time=2025-05-01T16:40:18.955+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.65"
load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
time=2025-05-01T16:40:19.207+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.66"
time=2025-05-01T16:45:19.402+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.66 - "
time=2025-05-01T16:45:19.402+02:00 level=DEBUG source=sched.go:460 msg="triggering expiration for failed load" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T16:45:19.403+02:00 level=DEBUG source=sched.go:362 msg="runner expired event received" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T16:45:19.403+02:00 level=DEBUG source=sched.go:377 msg="got lock to unload" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T16:45:19.403+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="118.2 GiB" before.free_swap="124.4 GiB" now.total="127.9 GiB" now.free="115.5 GiB" now.free_swap="106.2 GiB"
[GIN] 2025/05/01 - 16:45:19 | 500 |          5m9s |       127.0.0.1 | POST     "/api/generate"
time=2025-05-01T16:45:19.446+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="1.6 GiB" now.used="9.6 GiB"
time=2025-05-01T16:45:19.446+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="7.0 GiB" now.total="8.0 GiB" now.free="1.2 GiB" now.used="6.8 GiB"
releasing nvml library
time=2025-05-01T16:45:19.477+02:00 level=DEBUG source=server.go:1001 msg="stopping llama server"
time=2025-05-01T16:45:19.485+02:00 level=DEBUG source=server.go:1007 msg="waiting for llama server to exit"
time=2025-05-01T16:45:19.698+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="115.5 GiB" before.free_swap="106.2 GiB" now.total="127.9 GiB" now.free="115.6 GiB" now.free_swap="116.5 GiB"
time=2025-05-01T16:45:19.873+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="1.6 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB"
time=2025-05-01T16:45:19.873+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="1.2 GiB" now.total="8.0 GiB" now.free="7.0 GiB" now.used="1021.9 MiB"
releasing nvml library
time=2025-05-01T16:45:19.875+02:00 level=DEBUG source=sched.go:661 msg="gpu VRAM free memory converged after 0.47 seconds" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T16:45:19.990+02:00 level=DEBUG source=server.go:1011 msg="llama server stopped"
time=2025-05-01T16:45:19.990+02:00 level=DEBUG source=sched.go:382 msg="runner released" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T16:45:19.990+02:00 level=DEBUG source=sched.go:386 msg="sending an unloaded event" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T16:45:19.990+02:00 level=DEBUG source=sched.go:310 msg="ignoring unload event with no pending requests"

@neoxeo commented on GitHub (May 1, 2025): PS C:\> $env:OLLAMA_DEBUG="1"; $env:CUDA_VISIBLE_DEVICES="0,1" ; & "ollama app.exe" PS C:\> ollama run qwen3:30b-a3b --verbose => Error ![Image](https://github.com/user-attachments/assets/e7225045-0f17-45d1-8723-42811dd1765f) Logs : ``` 2025/05/01 16:40:02 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Ollama_Models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-05-01T16:40:02.855+02:00 level=INFO source=images.go:458 msg="total blobs: 52" time=2025-05-01T16:40:02.874+02:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0" time=2025-05-01T16:40:02.893+02:00 level=INFO source=routes.go:1299 msg="Listening on 127.0.0.1:11434 (version 0.6.6)" time=2025-05-01T16:40:02.893+02:00 level=DEBUG source=sched.go:107 msg="starting llm scheduler" time=2025-05-01T16:40:02.894+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-05-01T16:40:02.894+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-05-01T16:40:02.894+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=0 threads=28 time=2025-05-01T16:40:02.894+02:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-05-01T16:40:02.894+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll time=2025-05-01T16:40:02.894+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvml.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvml.dll C:\\Python\\miniforge3\\nvml.dll C:\\Python\\miniforge3\\Scripts\\nvml.dll C:\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-05-01T16:40:02.901+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll" time=2025-05-01T16:40:02.906+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-05-01T16:40:02.965+02:00 level=DEBUG source=gpu.go:111 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll time=2025-05-01T16:40:02.965+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvcuda.dll time=2025-05-01T16:40:02.972+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvcuda.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvcuda.dll C:\\Python\\miniforge3\\nvcuda.dll C:\\Python\\miniforge3\\Scripts\\nvcuda.dll C:\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]" time=2025-05-01T16:40:02.979+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll" time=2025-05-01T16:40:02.984+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll] initializing C:\Windows\system32\nvcuda.dll dlsym: cuInit - 00007FFACF351F80 dlsym: cuDriverGetVersion - 00007FFACF352020 dlsym: cuDeviceGetCount - 00007FFACF352816 dlsym: cuDeviceGet - 00007FFACF352810 dlsym: cuDeviceGetAttribute - 00007FFACF352170 dlsym: cuDeviceGetUuid - 00007FFACF352822 dlsym: cuDeviceGetName - 00007FFACF35281C dlsym: cuCtxCreate_v3 - 00007FFACF352894 dlsym: cuMemGetInfo_v2 - 00007FFACF352996 dlsym: cuCtxDestroy - 00007FFACF3528A6 calling cuInit calling cuDriverGetVersion raw version 0x2f3a CUDA driver version: 12.9 calling cuDeviceGetCount device count 2 time=2025-05-01T16:40:03.028+02:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll [GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA totalMem 12281 mb [GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA freeMem 11248 mb [GPU-962a842b-b382-6457-65a1-3cffec62ba6f] Compute Capability 8.6 time=2025-05-01T16:40:03.251+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda compute=8.6 driver=12.9 name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" [GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA totalMem 8191 mb [GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA freeMem 7296 mb [GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] Compute Capability 5.2 time=2025-05-01T16:40:03.415+02:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: Le module spécifié est introuvable." releasing cuda driver library releasing nvml library time=2025-05-01T16:40:03.419+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA RTX A2000 12GB" total="12.0 GiB" available="11.0 GiB" time=2025-05-01T16:40:03.419+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 library=cuda variant=v12 compute=5.2 driver=12.9 name="Quadro M5000" total="8.0 GiB" available="7.1 GiB" [GIN] 2025/05/01 - 16:40:09 | 200 | 0s | 127.0.0.1 | HEAD "/" time=2025-05-01T16:40:09.501+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T16:40:09.543+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 [GIN] 2025/05/01 - 16:40:09 | 200 | 119.4624ms | 127.0.0.1 | POST "/api/show" time=2025-05-01T16:40:09.779+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T16:40:09.792+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="118.2 GiB" before.free_swap="124.4 GiB" now.total="127.9 GiB" now.free="118.2 GiB" now.free_swap="124.4 GiB" time=2025-05-01T16:40:09.823+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB" time=2025-05-01T16:40:09.824+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.0 GiB" now.used="986.6 MiB" releasing nvml library time=2025-05-01T16:40:09.825+02:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=6 gpu_count=2 time=2025-05-01T16:40:09.858+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T16:40:09.930+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T16:40:09.946+02:00 level=DEBUG source=sched.go:226 msg="loading first model" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T16:40:09.946+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]" time=2025-05-01T16:40:09.946+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T16:40:09.947+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.0 GiB]" time=2025-05-01T16:40:09.947+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T16:40:09.948+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]" time=2025-05-01T16:40:09.948+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T16:40:09.950+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.0 GiB]" time=2025-05-01T16:40:09.950+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T16:40:09.950+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.0 GiB]" time=2025-05-01T16:40:09.950+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T16:40:09.952+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.0 GiB]" time=2025-05-01T16:40:09.952+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T16:40:09.953+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="118.2 GiB" before.free_swap="124.4 GiB" now.total="127.9 GiB" now.free="118.2 GiB" now.free_swap="124.4 GiB" time=2025-05-01T16:40:09.995+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB" time=2025-05-01T16:40:09.995+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="7.0 GiB" now.total="8.0 GiB" now.free="7.0 GiB" now.used="986.6 MiB" releasing nvml library time=2025-05-01T16:40:09.997+02:00 level=INFO source=server.go:105 msg="system memory" total="127.9 GiB" free="118.2 GiB" free_swap="124.4 GiB" time=2025-05-01T16:40:09.997+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.0 GiB]" time=2025-05-01T16:40:09.997+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T16:40:10.003+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=43 layers.split=27,16 memory.available="[11.0 GiB 7.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="19.5 GiB" memory.required.partial="17.4 GiB" memory.required.kv="192.0 MiB" memory.required.allocations="[10.6 GiB 6.7 GiB]" memory.weights.total="17.2 GiB" memory.weights.repeating="16.9 GiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB" time=2025-05-01T16:40:10.005+02:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]" llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 30B A3B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 30B-A3B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 7: qwen3moe.context_length u32 = 40960 llama_model_loader: - kv 8: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 9: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 10: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen3moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 15: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 16: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 17: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 18: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 20: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 15 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type f16: 48 tensors llama_model_loader: - type q4_K: 265 tensors llama_model_loader: - type q6_K: 25 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 17.34 GiB (4.88 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 30.53 B print_info: general.name = Qwen3 30B A3B print_info: n_ff_exp = 0 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-05-01T16:40:10.389+02:00 level=DEBUG source=server.go:335 msg="adding gpu library" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 time=2025-05-01T16:40:10.389+02:00 level=DEBUG source=server.go:343 msg="adding gpu dependency paths" paths=[C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12] time=2025-05-01T16:40:10.389+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Ollama_Models\\blobs\\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac --ctx-size 2048 --batch-size 512 --n-gpu-layers 43 --verbose --threads 14 --no-mmap --parallel 1 --tensor-split 27,16 --port 50067" time=2025-05-01T16:40:10.389+02:00 level=DEBUG source=server.go:423 msg=subprocess environment="[CUDA_VISIBLE_DEVICES=GPU-962a842b-b382-6457-65a1-3cffec62ba6f,GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 PATH=C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama;C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\Testeur1\\.dotnet\\tools;C:\\Python\\miniforge3;C:\\Python\\miniforge3\\Scripts;;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama]" time=2025-05-01T16:40:10.402+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-05-01T16:40:10.402+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-05-01T16:40:10.406+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-05-01T16:40:10.477+02:00 level=INFO source=runner.go:853 msg="starting go runner" time=2025-05-01T16:40:10.627+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA RTX A2000 12GB, compute capability 8.6, VMM: yes Device 1: Quadro M5000, compute capability 5.2, VMM: yes load_backend: loaded CUDA backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-05-01T16:40:11.194+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin" time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS" time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS" time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\system32 time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\Wbem time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\WindowsPowerShell\v1.0 time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\OpenSSH time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Git\\cmd" time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL" time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL" time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT" time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT" time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\dotnet" time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common" time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin" time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Programs\Python\Launcher time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Microsoft\WindowsApps time=2025-05-01T16:40:11.195+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama time=2025-05-01T16:40:11.205+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\GitHubDesktop\bin time=2025-05-01T16:40:11.205+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\.dotnet\tools time=2025-05-01T16:40:11.205+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3 time=2025-05-01T16:40:11.205+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3\Scripts time=2025-05-01T16:40:11.205+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\ time=2025-05-01T16:40:11.205+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll time=2025-05-01T16:40:11.239+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-05-01T16:40:11.241+02:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:50067" time=2025-05-01T16:40:11.416+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA RTX A2000 12GB) - 11248 MiB free llama_model_load_from_file_impl: using device CUDA1 (Quadro M5000) - 7296 MiB free llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 30B A3B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 30B-A3B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 7: qwen3moe.context_length u32 = 40960 llama_model_loader: - kv 8: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 9: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 10: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen3moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 15: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 16: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 17: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 18: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 20: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 15 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type f16: 48 tensors llama_model_loader: - type q4_K: 265 tensors llama_model_loader: - type q6_K: 25 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 17.34 GiB (4.88 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 0 print_info: n_ctx_train = 40960 print_info: n_embd = 2048 print_info: n_layer = 48 print_info: n_head = 32 print_info: n_head_kv = 4 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 6144 print_info: n_expert = 128 print_info: n_expert_used = 8 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 40960 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = ?B print_info: model params = 30.53 B print_info: general.name = Qwen3 30B A3B print_info: n_ff_exp = 768 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: layer 0 assigned to device CPU, is_swa = 0 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 0 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CPU, is_swa = 0 load_tensors: layer 5 assigned to device CUDA0, is_swa = 0 load_tensors: layer 6 assigned to device CUDA0, is_swa = 0 load_tensors: layer 7 assigned to device CUDA0, is_swa = 0 load_tensors: layer 8 assigned to device CUDA0, is_swa = 0 load_tensors: layer 9 assigned to device CUDA0, is_swa = 0 load_tensors: layer 10 assigned to device CUDA0, is_swa = 0 load_tensors: layer 11 assigned to device CUDA0, is_swa = 0 load_tensors: layer 12 assigned to device CUDA0, is_swa = 0 load_tensors: layer 13 assigned to device CUDA0, is_swa = 0 load_tensors: layer 14 assigned to device CUDA0, is_swa = 0 load_tensors: layer 15 assigned to device CUDA0, is_swa = 0 load_tensors: layer 16 assigned to device CUDA0, is_swa = 0 load_tensors: layer 17 assigned to device CUDA0, is_swa = 0 load_tensors: layer 18 assigned to device CUDA0, is_swa = 0 load_tensors: layer 19 assigned to device CUDA0, is_swa = 0 load_tensors: layer 20 assigned to device CUDA0, is_swa = 0 load_tensors: layer 21 assigned to device CUDA0, is_swa = 0 load_tensors: layer 22 assigned to device CUDA0, is_swa = 0 load_tensors: layer 23 assigned to device CUDA0, is_swa = 0 load_tensors: layer 24 assigned to device CUDA0, is_swa = 0 load_tensors: layer 25 assigned to device CUDA0, is_swa = 0 load_tensors: layer 26 assigned to device CUDA0, is_swa = 0 load_tensors: layer 27 assigned to device CUDA0, is_swa = 0 load_tensors: layer 28 assigned to device CUDA0, is_swa = 0 load_tensors: layer 29 assigned to device CUDA0, is_swa = 0 load_tensors: layer 30 assigned to device CUDA0, is_swa = 0 load_tensors: layer 31 assigned to device CUDA0, is_swa = 0 load_tensors: layer 32 assigned to device CUDA1, is_swa = 0 load_tensors: layer 33 assigned to device CUDA1, is_swa = 0 load_tensors: layer 34 assigned to device CUDA1, is_swa = 0 load_tensors: layer 35 assigned to device CUDA1, is_swa = 0 load_tensors: layer 36 assigned to device CUDA1, is_swa = 0 load_tensors: layer 37 assigned to device CUDA1, is_swa = 0 load_tensors: layer 38 assigned to device CUDA1, is_swa = 0 load_tensors: layer 39 assigned to device CUDA1, is_swa = 0 load_tensors: layer 40 assigned to device CUDA1, is_swa = 0 load_tensors: layer 41 assigned to device CUDA1, is_swa = 0 load_tensors: layer 42 assigned to device CUDA1, is_swa = 0 load_tensors: layer 43 assigned to device CUDA1, is_swa = 0 load_tensors: layer 44 assigned to device CUDA1, is_swa = 0 load_tensors: layer 45 assigned to device CUDA1, is_swa = 0 load_tensors: layer 46 assigned to device CUDA1, is_swa = 0 load_tensors: layer 47 assigned to device CUDA1, is_swa = 0 load_tensors: layer 48 assigned to device CPU, is_swa = 0 load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead load_tensors: offloading 43 repeating layers to GPU load_tensors: offloaded 43/49 layers to GPU load_tensors: CUDA_Host model buffer size = 2173.83 MiB load_tensors: CUDA0 model buffer size = 9533.14 MiB load_tensors: CUDA1 model buffer size = 5880.27 MiB load_tensors: CPU model buffer size = 166.92 MiB load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads time=2025-05-01T16:40:12.673+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.01" time=2025-05-01T16:40:12.925+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.04" time=2025-05-01T16:40:13.176+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.06" time=2025-05-01T16:40:13.428+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.09" load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2025-05-01T16:40:13.679+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.12" time=2025-05-01T16:40:13.930+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.15" time=2025-05-01T16:40:14.182+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.18" time=2025-05-01T16:40:14.433+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.20" time=2025-05-01T16:40:14.684+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.22" time=2025-05-01T16:40:14.935+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.25" time=2025-05-01T16:40:15.187+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.28" time=2025-05-01T16:40:15.438+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.30" time=2025-05-01T16:40:15.690+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.32" time=2025-05-01T16:40:15.940+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.35" time=2025-05-01T16:40:16.192+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.37" time=2025-05-01T16:40:16.443+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.40" time=2025-05-01T16:40:16.694+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.42" time=2025-05-01T16:40:16.946+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.45" time=2025-05-01T16:40:17.197+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.47" time=2025-05-01T16:40:17.448+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.50" time=2025-05-01T16:40:17.699+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.52" time=2025-05-01T16:40:17.951+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.55" time=2025-05-01T16:40:18.202+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.57" time=2025-05-01T16:40:18.453+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.59" time=2025-05-01T16:40:18.704+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.62" time=2025-05-01T16:40:18.955+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.65" load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1 time=2025-05-01T16:40:19.207+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.66" time=2025-05-01T16:45:19.402+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.66 - " time=2025-05-01T16:45:19.402+02:00 level=DEBUG source=sched.go:460 msg="triggering expiration for failed load" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T16:45:19.403+02:00 level=DEBUG source=sched.go:362 msg="runner expired event received" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T16:45:19.403+02:00 level=DEBUG source=sched.go:377 msg="got lock to unload" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T16:45:19.403+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="118.2 GiB" before.free_swap="124.4 GiB" now.total="127.9 GiB" now.free="115.5 GiB" now.free_swap="106.2 GiB" [GIN] 2025/05/01 - 16:45:19 | 500 | 5m9s | 127.0.0.1 | POST "/api/generate" time=2025-05-01T16:45:19.446+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="1.6 GiB" now.used="9.6 GiB" time=2025-05-01T16:45:19.446+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="7.0 GiB" now.total="8.0 GiB" now.free="1.2 GiB" now.used="6.8 GiB" releasing nvml library time=2025-05-01T16:45:19.477+02:00 level=DEBUG source=server.go:1001 msg="stopping llama server" time=2025-05-01T16:45:19.485+02:00 level=DEBUG source=server.go:1007 msg="waiting for llama server to exit" time=2025-05-01T16:45:19.698+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="115.5 GiB" before.free_swap="106.2 GiB" now.total="127.9 GiB" now.free="115.6 GiB" now.free_swap="116.5 GiB" time=2025-05-01T16:45:19.873+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="862.0 MiB" before.total="12.0 GiB" before.free="1.6 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="172.0 MiB" time=2025-05-01T16:45:19.873+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="0 B" before.total="8.0 GiB" before.free="1.2 GiB" now.total="8.0 GiB" now.free="7.0 GiB" now.used="1021.9 MiB" releasing nvml library time=2025-05-01T16:45:19.875+02:00 level=DEBUG source=sched.go:661 msg="gpu VRAM free memory converged after 0.47 seconds" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T16:45:19.990+02:00 level=DEBUG source=server.go:1011 msg="llama server stopped" time=2025-05-01T16:45:19.990+02:00 level=DEBUG source=sched.go:382 msg="runner released" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T16:45:19.990+02:00 level=DEBUG source=sched.go:386 msg="sending an unloaded event" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T16:45:19.990+02:00 level=DEBUG source=sched.go:310 msg="ignoring unload event with no pending requests" ```

GiteaMirror commented

2026-04-29 03:09:00 -05:00

@rick-github commented on GitHub (May 1, 2025):

So in both cases, M5000 then A2000:

load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0

time=2025-05-01T16:34:23.156+02:00 level=ERROR source=sched.go:457 msg="error loading llama server"
 error="timed out waiting for llama runner to start - progress 0.12 - "

and A2000 then M5000:

load_tensors: layer  31 assigned to device CUDA0, is_swa = 0
load_tensors: layer  32 assigned to device CUDA1, is_swa = 0

time=2025-05-01T16:45:19.402+02:00 level=ERROR source=sched.go:457 msg="error loading llama server"
 error="timed out waiting for llama runner to start - progress 0.66 - "

the load stalled when it went from loading the previous device (CPU or GPU) to loading the M5000: 49 * 0.12 = 5.9, 49 * 0.66 = 32. Yet when it was just the M5000 on its own, it worked fine, loading in less than 6 seconds:

time=2025-05-01T16:16:21.674+02:00 level=INFO source=server.go:619 msg="llama runner started in 5.79 seconds"

I don't know what would cause this. I don't think you could work around it by setting OLLAMA_LOAD_TIMEOUT high since the load seems to come to a dead stop. Maybe there's some sort of PCI contention? Is changing the slots the cards are plugged in to an option?

@rick-github commented on GitHub (May 1, 2025): So in both cases, M5000 then A2000: ``` load_tensors: layer 4 assigned to device CPU, is_swa = 0 load_tensors: layer 5 assigned to device CUDA0, is_swa = 0 time=2025-05-01T16:34:23.156+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.12 - " ``` and A2000 then M5000: ``` load_tensors: layer 31 assigned to device CUDA0, is_swa = 0 load_tensors: layer 32 assigned to device CUDA1, is_swa = 0 time=2025-05-01T16:45:19.402+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.66 - " ``` the load stalled when it went from loading the previous device (CPU or GPU) to loading the M5000: 49 * 0.12 = 5.9, 49 * 0.66 = 32. Yet when it was just the M5000 on its own, it worked fine, loading in less than 6 seconds: ``` time=2025-05-01T16:16:21.674+02:00 level=INFO source=server.go:619 msg="llama runner started in 5.79 seconds" ``` I don't know what would cause this. I don't think you could work around it by setting `OLLAMA_LOAD_TIMEOUT` high since the load seems to come to a dead stop. Maybe there's some sort of PCI contention? Is changing the slots the cards are plugged in to an option?

GiteaMirror commented

2026-04-29 03:09:02 -05:00

@neoxeo commented on GitHub (May 1, 2025):

Thanks again for time you give to try to solve my problem.

Is changing the slots the cards are plugged in to an option? : yes I can try to swap the 2 cards.

I test this now and give you result.

If problem is always present, the only option will be to change M5000 by a 3060 12 Gb and hope it solve problem ...

@neoxeo commented on GitHub (May 1, 2025): Thanks again for time you give to try to solve my problem. Is changing the slots the cards are plugged in to an option? : yes I can try to swap the 2 cards. I test this now and give you result. If problem is always present, the only option will be to change M5000 by a 3060 12 Gb and hope it solve problem ...

GiteaMirror commented

2026-04-29 03:09:03 -05:00

@neoxeo commented on GitHub (May 1, 2025):

I have swaped 2 cards and now I have new errors when I run this :
PS C:> $env:OLLAMA_DEBUG="1"; $env:CUDA_VISIBLE_DEVICES="0,1" ; & "ollama app.exe"
PS C:> ollama run qwen3:30b-a3b --verbose

Log :

2025/05/01 17:11:22 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Ollama_Models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-05-01T17:11:22.567+02:00 level=INFO source=images.go:458 msg="total blobs: 52"
time=2025-05-01T17:11:22.585+02:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0"
time=2025-05-01T17:11:22.615+02:00 level=INFO source=routes.go:1299 msg="Listening on 127.0.0.1:11434 (version 0.6.6)"
time=2025-05-01T17:11:22.616+02:00 level=DEBUG source=sched.go:107 msg="starting llm scheduler"
time=2025-05-01T17:11:22.616+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-05-01T17:11:22.617+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-05-01T17:11:22.618+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=0 threads=28
time=2025-05-01T17:11:22.618+02:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-05-01T17:11:22.618+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll
time=2025-05-01T17:11:22.618+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvml.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvml.dll C:\\Python\\miniforge3\\nvml.dll C:\\Python\\miniforge3\\Scripts\\nvml.dll C:\\Users\\Testeur1\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2025-05-01T17:11:22.626+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll"
time=2025-05-01T17:11:22.629+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2025-05-01T17:11:22.650+02:00 level=DEBUG source=gpu.go:111 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll
time=2025-05-01T17:11:22.650+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvcuda.dll
time=2025-05-01T17:11:22.661+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvcuda.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvcuda.dll C:\\Python\\miniforge3\\nvcuda.dll C:\\Python\\miniforge3\\Scripts\\nvcuda.dll C:\\Users\\Testeur1\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]"
time=2025-05-01T17:11:22.668+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll"
time=2025-05-01T17:11:22.673+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll]
initializing C:\Windows\system32\nvcuda.dll
dlsym: cuInit - 00007FFCFC396590
dlsym: cuDriverGetVersion - 00007FFCFC396640
dlsym: cuDeviceGetCount - 00007FFCFC3A4416
dlsym: cuDeviceGet - 00007FFCFC3A4410
dlsym: cuDeviceGetAttribute - 00007FFCFC3967B0
dlsym: cuDeviceGetUuid - 00007FFCFC3A4422
dlsym: cuDeviceGetName - 00007FFCFC3A441C
dlsym: cuCtxCreate_v3 - 00007FFCFC3A4494
dlsym: cuMemGetInfo_v2 - 00007FFCFC3A4578
dlsym: cuCtxDestroy - 00007FFCFC3A44A0
calling cuInit
calling cuDriverGetVersion
raw version 0x2b20
CUDA driver version: 11.4
calling cuDeviceGetCount
device count 2
time=2025-05-01T17:11:22.887+02:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll
[GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA totalMem 12282 mb
[GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA freeMem 11282 mb
[GPU-962a842b-b382-6457-65a1-3cffec62ba6f] Compute Capability 8.6
time=2025-05-01T17:11:23.091+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda compute=8.6 driver=11.4 name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB"
[GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA totalMem 8192 mb
[GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA freeMem 7311 mb
[GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] Compute Capability 5.2
time=2025-05-01T17:11:23.233+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 library=cuda compute=5.2 driver=11.4 name="Quadro M5000" overhead="693.7 MiB"
time=2025-05-01T17:11:23.251+02:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: Le module spécifié est introuvable."
releasing cuda driver library
releasing nvml library
time=2025-05-01T17:11:23.252+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda variant=v11 compute=8.6 driver=11.4 name="NVIDIA RTX A2000 12GB" total="12.0 GiB" available="11.0 GiB"
time=2025-05-01T17:11:23.252+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 library=cuda variant=v11 compute=5.2 driver=11.4 name="Quadro M5000" total="8.0 GiB" available="7.1 GiB"
[GIN] 2025/05/01 - 17:11:25 | 200 |       610.5µs |       127.0.0.1 | HEAD     "/"
time=2025-05-01T17:11:25.579+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T17:11:25.616+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
[GIN] 2025/05/01 - 17:11:25 | 200 |    105.0104ms |       127.0.0.1 | POST     "/api/show"
time=2025-05-01T17:11:25.836+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T17:11:25.847+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.8 GiB" before.free_swap="128.2 GiB" now.total="127.9 GiB" now.free="120.8 GiB" now.free_swap="128.1 GiB"
time=2025-05-01T17:11:25.866+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:25.884+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="186.9 MiB"
releasing nvml library
time=2025-05-01T17:11:25.889+02:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=6 gpu_count=2
time=2025-05-01T17:11:25.917+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T17:11:25.947+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-01T17:11:25.959+02:00 level=DEBUG source=sched.go:226 msg="loading first model" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T17:11:25.961+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2025-05-01T17:11:25.961+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T17:11:25.962+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.1 GiB]"
time=2025-05-01T17:11:25.962+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T17:11:25.964+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2025-05-01T17:11:25.964+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T17:11:25.966+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.1 GiB]"
time=2025-05-01T17:11:25.966+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T17:11:25.967+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]"
time=2025-05-01T17:11:25.967+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T17:11:25.968+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]"
time=2025-05-01T17:11:25.968+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T17:11:25.969+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.8 GiB" before.free_swap="128.1 GiB" now.total="127.9 GiB" now.free="120.8 GiB" now.free_swap="128.1 GiB"
time=2025-05-01T17:11:25.991+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:26.007+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="186.9 MiB"
releasing nvml library
time=2025-05-01T17:11:26.016+02:00 level=INFO source=server.go:105 msg="system memory" total="127.9 GiB" free="120.8 GiB" free_swap="128.1 GiB"
time=2025-05-01T17:11:26.016+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]"
time=2025-05-01T17:11:26.016+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0
time=2025-05-01T17:11:26.016+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=44 layers.split=27,17 memory.available="[11.0 GiB 7.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="19.5 GiB" memory.required.partial="17.8 GiB" memory.required.kv="192.0 MiB" memory.required.allocations="[10.7 GiB 7.1 GiB]" memory.weights.total="17.2 GiB" memory.weights.repeating="16.9 GiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB"
time=2025-05-01T17:11:26.017+02:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible="[cuda_v11 cuda_v12]"
llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv   7:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   8:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv   9:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  15:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  16:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  17:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  18:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type  f16:   48 tensors
llama_model_loader: - type q4_K:  265 tensors
llama_model_loader: - type q6_K:   25 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 17.34 GiB (4.88 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B
print_info: n_ff_exp         = 0
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-05-01T17:11:26.375+02:00 level=DEBUG source=server.go:335 msg="adding gpu library" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11
time=2025-05-01T17:11:26.375+02:00 level=DEBUG source=server.go:343 msg="adding gpu dependency paths" paths=[C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11]
time=2025-05-01T17:11:26.375+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Ollama_Models\\blobs\\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac --ctx-size 2048 --batch-size 512 --n-gpu-layers 44 --verbose --threads 14 --no-mmap --parallel 1 --tensor-split 27,17 --port 49680"
time=2025-05-01T17:11:26.376+02:00 level=DEBUG source=server.go:423 msg=subprocess environment="[CUDA_VISIBLE_DEVICES=GPU-962a842b-b382-6457-65a1-3cffec62ba6f,GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 PATH=C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v11;C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama;C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\Testeur1\\.dotnet\\tools;C:\\Python\\miniforge3;C:\\Python\\miniforge3\\Scripts;;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v11;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama]"
time=2025-05-01T17:11:26.388+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-05-01T17:11:26.388+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-05-01T17:11:26.390+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-05-01T17:11:26.440+02:00 level=INFO source=runner.go:853 msg="starting go runner"
time=2025-05-01T17:11:26.574+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A2000 12GB, compute capability 8.6, VMM: yes
  Device 1: Quadro M5000, compute capability 5.2, VMM: yes
load_backend: loaded CUDA backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11\ggml-cuda.dll
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin"
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS"
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS"
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\system32
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\Wbem
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\WindowsPowerShell\v1.0
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\OpenSSH
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Git\\cmd"
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL"
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL"
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT"
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT"
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\dotnet"
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common"
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin"
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Programs\Python\Launcher
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Microsoft\WindowsApps
time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama
time=2025-05-01T17:11:27.478+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\GitHubDesktop\bin
time=2025-05-01T17:11:27.478+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\.dotnet\tools
time=2025-05-01T17:11:27.478+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3
time=2025-05-01T17:11:27.478+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3\Scripts
time=2025-05-01T17:11:27.478+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1
time=2025-05-01T17:11:27.478+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
time=2025-05-01T17:11:27.657+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-05-01T17:11:27.660+02:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:49680"
time=2025-05-01T17:11:27.898+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA RTX A2000 12GB) - 11204 MiB free
llama_model_load_from_file_impl: using device CUDA1 (Quadro M5000) - 7279 MiB free
llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv   7:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   8:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv   9:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  15:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  16:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  17:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  18:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type  f16:   48 tensors
llama_model_loader: - type q4_K:  265 tensors
llama_model_loader: - type q6_K:   25 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 17.34 GiB (4.88 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 2048
print_info: n_layer          = 48
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = ?B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B
print_info: n_ff_exp         = 768
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
load_tensors: layer  22 assigned to device CUDA0, is_swa = 0
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 0
load_tensors: layer  25 assigned to device CUDA0, is_swa = 0
load_tensors: layer  26 assigned to device CUDA0, is_swa = 0
load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
load_tensors: layer  28 assigned to device CUDA0, is_swa = 0
load_tensors: layer  29 assigned to device CUDA0, is_swa = 0
load_tensors: layer  30 assigned to device CUDA0, is_swa = 0
load_tensors: layer  31 assigned to device CUDA1, is_swa = 0
load_tensors: layer  32 assigned to device CUDA1, is_swa = 0
load_tensors: layer  33 assigned to device CUDA1, is_swa = 0
load_tensors: layer  34 assigned to device CUDA1, is_swa = 0
load_tensors: layer  35 assigned to device CUDA1, is_swa = 0
load_tensors: layer  36 assigned to device CUDA1, is_swa = 0
load_tensors: layer  37 assigned to device CUDA1, is_swa = 0
load_tensors: layer  38 assigned to device CUDA1, is_swa = 0
load_tensors: layer  39 assigned to device CUDA1, is_swa = 0
load_tensors: layer  40 assigned to device CUDA1, is_swa = 0
load_tensors: layer  41 assigned to device CUDA1, is_swa = 0
load_tensors: layer  42 assigned to device CUDA1, is_swa = 0
load_tensors: layer  43 assigned to device CUDA1, is_swa = 0
load_tensors: layer  44 assigned to device CUDA1, is_swa = 0
load_tensors: layer  45 assigned to device CUDA1, is_swa = 0
load_tensors: layer  46 assigned to device CUDA1, is_swa = 0
load_tensors: layer  47 assigned to device CUDA1, is_swa = 0
load_tensors: layer  48 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 44 repeating layers to GPU
load_tensors: offloaded 44/49 layers to GPU
load_tensors:          CPU model buffer size =   166.92 MiB
load_tensors:    CUDA_Host model buffer size =  1787.75 MiB
load_tensors:        CUDA0 model buffer size =  9582.64 MiB
load_tensors:        CUDA1 model buffer size =  6216.85 MiB
load_all_data: no device found for buffer type CPU for async uploads
load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads
time=2025-05-01T17:11:30.902+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.02"
time=2025-05-01T17:11:31.153+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.04"
time=2025-05-01T17:11:31.403+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.06"
time=2025-05-01T17:11:31.653+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.07"
time=2025-05-01T17:11:31.904+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.10"
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
time=2025-05-01T17:11:32.154+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.12"
time=2025-05-01T17:11:32.404+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.14"
time=2025-05-01T17:11:32.654+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.15"
time=2025-05-01T17:11:32.905+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.18"
time=2025-05-01T17:11:33.155+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.20"
time=2025-05-01T17:11:33.406+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.22"
time=2025-05-01T17:11:33.656+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.24"
time=2025-05-01T17:11:33.907+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.26"
time=2025-05-01T17:11:34.157+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.28"
time=2025-05-01T17:11:34.407+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.30"
time=2025-05-01T17:11:34.657+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.32"
time=2025-05-01T17:11:34.908+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.34"
time=2025-05-01T17:11:35.158+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.36"
time=2025-05-01T17:11:35.409+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.38"
time=2025-05-01T17:11:35.659+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.41"
time=2025-05-01T17:11:35.910+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.42"
time=2025-05-01T17:11:36.161+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.45"
time=2025-05-01T17:11:36.411+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.46"
time=2025-05-01T17:11:36.662+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.48"
time=2025-05-01T17:11:36.912+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.51"
time=2025-05-01T17:11:37.162+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.52"
time=2025-05-01T17:11:37.413+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.54"
time=2025-05-01T17:11:37.663+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.57"
time=2025-05-01T17:11:37.913+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.58"
time=2025-05-01T17:11:38.164+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.61"
time=2025-05-01T17:11:38.415+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.62"
load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
CUDA error: operation not supported
  current device: 1, in function ggml_backend_cuda_event_record at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:2839
  cudaEventRecord((cudaEvent_t)event->context, cuda_ctx->stream())
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error
time=2025-05-01T17:11:38.866+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-01T17:11:39.483+02:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409"
time=2025-05-01T17:11:39.567+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="llama runner process has terminated: CUDA error"
time=2025-05-01T17:11:39.567+02:00 level=DEBUG source=sched.go:460 msg="triggering expiration for failed load" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T17:11:39.567+02:00 level=DEBUG source=sched.go:362 msg="runner expired event received" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T17:11:39.568+02:00 level=DEBUG source=sched.go:377 msg="got lock to unload" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
[GIN] 2025/05/01 - 17:11:39 | 500 |   13.7899546s |       127.0.0.1 | POST     "/api/generate"
time=2025-05-01T17:11:39.568+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.8 GiB" before.free_swap="128.1 GiB" now.total="127.9 GiB" now.free="120.6 GiB" now.free_swap="127.9 GiB"
time=2025-05-01T17:11:39.589+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:39.609+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="219.4 MiB"
releasing nvml library
time=2025-05-01T17:11:39.636+02:00 level=DEBUG source=server.go:1001 msg="stopping llama server"
time=2025-05-01T17:11:39.636+02:00 level=DEBUG source=sched.go:382 msg="runner released" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T17:11:39.864+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.6 GiB" before.free_swap="127.9 GiB" now.total="127.9 GiB" now.free="120.6 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:39.880+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:39.895+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:40.113+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.6 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:40.130+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:40.144+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:40.364+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:40.382+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:40.398+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:40.613+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:40.643+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:40.653+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:40.863+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:40.889+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:40.908+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:41.113+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:41.140+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:41.156+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:41.364+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:41.380+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:41.395+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:41.613+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:41.639+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:41.655+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:41.864+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:41.888+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:41.908+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:42.113+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:42.131+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:42.143+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:42.364+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:42.381+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:42.397+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:42.613+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:42.641+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:42.657+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:42.864+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:42.889+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:42.905+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:43.113+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:43.129+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:43.145+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:43.363+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:43.389+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:43.405+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:43.613+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:43.638+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:43.654+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:43.863+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:43.891+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:43.907+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:44.113+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:44.130+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:44.154+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:44.364+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:44.384+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:44.403+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:44.613+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0455091 model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T17:11:44.613+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:44.613+02:00 level=DEBUG source=sched.go:386 msg="sending an unloaded event" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T17:11:44.614+02:00 level=DEBUG source=sched.go:310 msg="ignoring unload event with no pending requests"
time=2025-05-01T17:11:44.638+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:44.654+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:44.864+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2955851 model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
time=2025-05-01T17:11:44.864+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB"
time=2025-05-01T17:11:44.881+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB"
time=2025-05-01T17:11:44.896+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB"
releasing nvml library
time=2025-05-01T17:11:45.113+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5452212 model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac

@neoxeo commented on GitHub (May 1, 2025): I have swaped 2 cards and now I have new errors when I run this : PS C:\> $env:OLLAMA_DEBUG="1"; $env:CUDA_VISIBLE_DEVICES="0,1" ; & "ollama app.exe" PS C:\> ollama run qwen3:30b-a3b --verbose Log : ``` 2025/05/01 17:11:22 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Ollama_Models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-05-01T17:11:22.567+02:00 level=INFO source=images.go:458 msg="total blobs: 52" time=2025-05-01T17:11:22.585+02:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0" time=2025-05-01T17:11:22.615+02:00 level=INFO source=routes.go:1299 msg="Listening on 127.0.0.1:11434 (version 0.6.6)" time=2025-05-01T17:11:22.616+02:00 level=DEBUG source=sched.go:107 msg="starting llm scheduler" time=2025-05-01T17:11:22.616+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-05-01T17:11:22.617+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-05-01T17:11:22.618+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=0 threads=28 time=2025-05-01T17:11:22.618+02:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-05-01T17:11:22.618+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll time=2025-05-01T17:11:22.618+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvml.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvml.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvml.dll C:\\Python\\miniforge3\\nvml.dll C:\\Python\\miniforge3\\Scripts\\nvml.dll C:\\Users\\Testeur1\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-05-01T17:11:22.626+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll" time=2025-05-01T17:11:22.629+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-05-01T17:11:22.650+02:00 level=DEBUG source=gpu.go:111 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll time=2025-05-01T17:11:22.650+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvcuda.dll time=2025-05-01T17:11:22.661+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL\\nvcuda.dll C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin\\nvcuda.dll C:\\Users\\Testeur1\\.dotnet\\tools\\nvcuda.dll C:\\Python\\miniforge3\\nvcuda.dll C:\\Python\\miniforge3\\Scripts\\nvcuda.dll C:\\Users\\Testeur1\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]" time=2025-05-01T17:11:22.668+02:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll" time=2025-05-01T17:11:22.673+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll] initializing C:\Windows\system32\nvcuda.dll dlsym: cuInit - 00007FFCFC396590 dlsym: cuDriverGetVersion - 00007FFCFC396640 dlsym: cuDeviceGetCount - 00007FFCFC3A4416 dlsym: cuDeviceGet - 00007FFCFC3A4410 dlsym: cuDeviceGetAttribute - 00007FFCFC3967B0 dlsym: cuDeviceGetUuid - 00007FFCFC3A4422 dlsym: cuDeviceGetName - 00007FFCFC3A441C dlsym: cuCtxCreate_v3 - 00007FFCFC3A4494 dlsym: cuMemGetInfo_v2 - 00007FFCFC3A4578 dlsym: cuCtxDestroy - 00007FFCFC3A44A0 calling cuInit calling cuDriverGetVersion raw version 0x2b20 CUDA driver version: 11.4 calling cuDeviceGetCount device count 2 time=2025-05-01T17:11:22.887+02:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll [GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA totalMem 12282 mb [GPU-962a842b-b382-6457-65a1-3cffec62ba6f] CUDA freeMem 11282 mb [GPU-962a842b-b382-6457-65a1-3cffec62ba6f] Compute Capability 8.6 time=2025-05-01T17:11:23.091+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda compute=8.6 driver=11.4 name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" [GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA totalMem 8192 mb [GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] CUDA freeMem 7311 mb [GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385] Compute Capability 5.2 time=2025-05-01T17:11:23.233+02:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 library=cuda compute=5.2 driver=11.4 name="Quadro M5000" overhead="693.7 MiB" time=2025-05-01T17:11:23.251+02:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: Le module spécifié est introuvable." releasing cuda driver library releasing nvml library time=2025-05-01T17:11:23.252+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-962a842b-b382-6457-65a1-3cffec62ba6f library=cuda variant=v11 compute=8.6 driver=11.4 name="NVIDIA RTX A2000 12GB" total="12.0 GiB" available="11.0 GiB" time=2025-05-01T17:11:23.252+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 library=cuda variant=v11 compute=5.2 driver=11.4 name="Quadro M5000" total="8.0 GiB" available="7.1 GiB" [GIN] 2025/05/01 - 17:11:25 | 200 | 610.5µs | 127.0.0.1 | HEAD "/" time=2025-05-01T17:11:25.579+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T17:11:25.616+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 [GIN] 2025/05/01 - 17:11:25 | 200 | 105.0104ms | 127.0.0.1 | POST "/api/show" time=2025-05-01T17:11:25.836+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T17:11:25.847+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.8 GiB" before.free_swap="128.2 GiB" now.total="127.9 GiB" now.free="120.8 GiB" now.free_swap="128.1 GiB" time=2025-05-01T17:11:25.866+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:25.884+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="186.9 MiB" releasing nvml library time=2025-05-01T17:11:25.889+02:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=6 gpu_count=2 time=2025-05-01T17:11:25.917+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T17:11:25.947+02:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-01T17:11:25.959+02:00 level=DEBUG source=sched.go:226 msg="loading first model" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T17:11:25.961+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]" time=2025-05-01T17:11:25.961+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T17:11:25.962+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.1 GiB]" time=2025-05-01T17:11:25.962+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T17:11:25.964+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]" time=2025-05-01T17:11:25.964+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T17:11:25.966+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[7.1 GiB]" time=2025-05-01T17:11:25.966+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T17:11:25.967+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]" time=2025-05-01T17:11:25.967+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T17:11:25.968+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]" time=2025-05-01T17:11:25.968+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T17:11:25.969+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.8 GiB" before.free_swap="128.1 GiB" now.total="127.9 GiB" now.free="120.8 GiB" now.free_swap="128.1 GiB" time=2025-05-01T17:11:25.991+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:26.007+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="186.9 MiB" releasing nvml library time=2025-05-01T17:11:26.016+02:00 level=INFO source=server.go:105 msg="system memory" total="127.9 GiB" free="120.8 GiB" free_swap="128.1 GiB" time=2025-05-01T17:11:26.016+02:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=2 available="[11.0 GiB 7.1 GiB]" time=2025-05-01T17:11:26.016+02:00 level=WARN source=ggml.go:152 msg="key not found" key=qwen3moe.vision.block_count default=0 time=2025-05-01T17:11:26.016+02:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=44 layers.split=27,17 memory.available="[11.0 GiB 7.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="19.5 GiB" memory.required.partial="17.8 GiB" memory.required.kv="192.0 MiB" memory.required.allocations="[10.7 GiB 7.1 GiB]" memory.weights.total="17.2 GiB" memory.weights.repeating="16.9 GiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB" time=2025-05-01T17:11:26.017+02:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible="[cuda_v11 cuda_v12]" llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 30B A3B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 30B-A3B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 7: qwen3moe.context_length u32 = 40960 llama_model_loader: - kv 8: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 9: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 10: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen3moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 15: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 16: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 17: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 18: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 20: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 15 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type f16: 48 tensors llama_model_loader: - type q4_K: 265 tensors llama_model_loader: - type q6_K: 25 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 17.34 GiB (4.88 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 30.53 B print_info: general.name = Qwen3 30B A3B print_info: n_ff_exp = 0 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-05-01T17:11:26.375+02:00 level=DEBUG source=server.go:335 msg="adding gpu library" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11 time=2025-05-01T17:11:26.375+02:00 level=DEBUG source=server.go:343 msg="adding gpu dependency paths" paths=[C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11] time=2025-05-01T17:11:26.375+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Ollama_Models\\blobs\\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac --ctx-size 2048 --batch-size 512 --n-gpu-layers 44 --verbose --threads 14 --no-mmap --parallel 1 --tensor-split 27,17 --port 49680" time=2025-05-01T17:11:26.376+02:00 level=DEBUG source=server.go:423 msg=subprocess environment="[CUDA_VISIBLE_DEVICES=GPU-962a842b-b382-6457-65a1-3cffec62ba6f,GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 PATH=C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v11;C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin\\;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\Testeur1\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama;C:\\Users\\Testeur1\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\Testeur1\\.dotnet\\tools;C:\\Python\\miniforge3;C:\\Python\\miniforge3\\Scripts;;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v11;C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Ollama\\lib\\ollama]" time=2025-05-01T17:11:26.388+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-05-01T17:11:26.388+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-05-01T17:11:26.390+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-05-01T17:11:26.440+02:00 level=INFO source=runner.go:853 msg="starting go runner" time=2025-05-01T17:11:26.574+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA RTX A2000 12GB, compute capability 8.6, VMM: yes Device 1: Quadro M5000, compute capability 5.2, VMM: yes load_backend: loaded CUDA backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\cuda_v11\ggml-cuda.dll time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\VMware\\VMware Workstation\\bin" time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\iCLS" time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\iCLS" time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\system32 time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\Wbem time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\WindowsPowerShell\v1.0 time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Windows\System32\OpenSSH time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Git\\cmd" time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL" time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL" time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\IPT" time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\IPT" time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files\\dotnet" time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common" time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path="C:\\Users\\Testeur1\\AppData\\Local\\Programs\\Eclipse Adoptium\\jdk-21.0.7.6-hotspot\\bin" time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Programs\Python\Launcher time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\Microsoft\WindowsApps time=2025-05-01T17:11:27.467+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama time=2025-05-01T17:11:27.478+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\AppData\Local\GitHubDesktop\bin time=2025-05-01T17:11:27.478+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1\.dotnet\tools time=2025-05-01T17:11:27.478+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3 time=2025-05-01T17:11:27.478+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Python\miniforge3\Scripts time=2025-05-01T17:11:27.478+02:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=C:\Users\Testeur1 time=2025-05-01T17:11:27.478+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\Testeur1\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll time=2025-05-01T17:11:27.657+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-05-01T17:11:27.660+02:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:49680" time=2025-05-01T17:11:27.898+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA RTX A2000 12GB) - 11204 MiB free llama_model_load_from_file_impl: using device CUDA1 (Quadro M5000) - 7279 MiB free llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 30B A3B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 30B-A3B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 7: qwen3moe.context_length u32 = 40960 llama_model_loader: - kv 8: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 9: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 10: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen3moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 15: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 16: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 17: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 18: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 20: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 15 llama_model_loader: - type f32: 241 tensors llama_model_loader: - type f16: 48 tensors llama_model_loader: - type q4_K: 265 tensors llama_model_loader: - type q6_K: 25 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 17.34 GiB (4.88 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 0 print_info: n_ctx_train = 40960 print_info: n_embd = 2048 print_info: n_layer = 48 print_info: n_head = 32 print_info: n_head_kv = 4 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 6144 print_info: n_expert = 128 print_info: n_expert_used = 8 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 40960 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = ?B print_info: model params = 30.53 B print_info: general.name = Qwen3 30B A3B print_info: n_ff_exp = 768 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: layer 0 assigned to device CPU, is_swa = 0 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 0 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CUDA0, is_swa = 0 load_tensors: layer 5 assigned to device CUDA0, is_swa = 0 load_tensors: layer 6 assigned to device CUDA0, is_swa = 0 load_tensors: layer 7 assigned to device CUDA0, is_swa = 0 load_tensors: layer 8 assigned to device CUDA0, is_swa = 0 load_tensors: layer 9 assigned to device CUDA0, is_swa = 0 load_tensors: layer 10 assigned to device CUDA0, is_swa = 0 load_tensors: layer 11 assigned to device CUDA0, is_swa = 0 load_tensors: layer 12 assigned to device CUDA0, is_swa = 0 load_tensors: layer 13 assigned to device CUDA0, is_swa = 0 load_tensors: layer 14 assigned to device CUDA0, is_swa = 0 load_tensors: layer 15 assigned to device CUDA0, is_swa = 0 load_tensors: layer 16 assigned to device CUDA0, is_swa = 0 load_tensors: layer 17 assigned to device CUDA0, is_swa = 0 load_tensors: layer 18 assigned to device CUDA0, is_swa = 0 load_tensors: layer 19 assigned to device CUDA0, is_swa = 0 load_tensors: layer 20 assigned to device CUDA0, is_swa = 0 load_tensors: layer 21 assigned to device CUDA0, is_swa = 0 load_tensors: layer 22 assigned to device CUDA0, is_swa = 0 load_tensors: layer 23 assigned to device CUDA0, is_swa = 0 load_tensors: layer 24 assigned to device CUDA0, is_swa = 0 load_tensors: layer 25 assigned to device CUDA0, is_swa = 0 load_tensors: layer 26 assigned to device CUDA0, is_swa = 0 load_tensors: layer 27 assigned to device CUDA0, is_swa = 0 load_tensors: layer 28 assigned to device CUDA0, is_swa = 0 load_tensors: layer 29 assigned to device CUDA0, is_swa = 0 load_tensors: layer 30 assigned to device CUDA0, is_swa = 0 load_tensors: layer 31 assigned to device CUDA1, is_swa = 0 load_tensors: layer 32 assigned to device CUDA1, is_swa = 0 load_tensors: layer 33 assigned to device CUDA1, is_swa = 0 load_tensors: layer 34 assigned to device CUDA1, is_swa = 0 load_tensors: layer 35 assigned to device CUDA1, is_swa = 0 load_tensors: layer 36 assigned to device CUDA1, is_swa = 0 load_tensors: layer 37 assigned to device CUDA1, is_swa = 0 load_tensors: layer 38 assigned to device CUDA1, is_swa = 0 load_tensors: layer 39 assigned to device CUDA1, is_swa = 0 load_tensors: layer 40 assigned to device CUDA1, is_swa = 0 load_tensors: layer 41 assigned to device CUDA1, is_swa = 0 load_tensors: layer 42 assigned to device CUDA1, is_swa = 0 load_tensors: layer 43 assigned to device CUDA1, is_swa = 0 load_tensors: layer 44 assigned to device CUDA1, is_swa = 0 load_tensors: layer 45 assigned to device CUDA1, is_swa = 0 load_tensors: layer 46 assigned to device CUDA1, is_swa = 0 load_tensors: layer 47 assigned to device CUDA1, is_swa = 0 load_tensors: layer 48 assigned to device CPU, is_swa = 0 load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead load_tensors: offloading 44 repeating layers to GPU load_tensors: offloaded 44/49 layers to GPU load_tensors: CPU model buffer size = 166.92 MiB load_tensors: CUDA_Host model buffer size = 1787.75 MiB load_tensors: CUDA0 model buffer size = 9582.64 MiB load_tensors: CUDA1 model buffer size = 6216.85 MiB load_all_data: no device found for buffer type CPU for async uploads load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads time=2025-05-01T17:11:30.902+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.02" time=2025-05-01T17:11:31.153+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.04" time=2025-05-01T17:11:31.403+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.06" time=2025-05-01T17:11:31.653+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.07" time=2025-05-01T17:11:31.904+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.10" load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2025-05-01T17:11:32.154+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.12" time=2025-05-01T17:11:32.404+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.14" time=2025-05-01T17:11:32.654+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.15" time=2025-05-01T17:11:32.905+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.18" time=2025-05-01T17:11:33.155+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.20" time=2025-05-01T17:11:33.406+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.22" time=2025-05-01T17:11:33.656+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.24" time=2025-05-01T17:11:33.907+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.26" time=2025-05-01T17:11:34.157+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.28" time=2025-05-01T17:11:34.407+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.30" time=2025-05-01T17:11:34.657+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.32" time=2025-05-01T17:11:34.908+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.34" time=2025-05-01T17:11:35.158+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.36" time=2025-05-01T17:11:35.409+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.38" time=2025-05-01T17:11:35.659+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.41" time=2025-05-01T17:11:35.910+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.42" time=2025-05-01T17:11:36.161+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.45" time=2025-05-01T17:11:36.411+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.46" time=2025-05-01T17:11:36.662+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.48" time=2025-05-01T17:11:36.912+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.51" time=2025-05-01T17:11:37.162+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.52" time=2025-05-01T17:11:37.413+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.54" time=2025-05-01T17:11:37.663+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.57" time=2025-05-01T17:11:37.913+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.58" time=2025-05-01T17:11:38.164+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.61" time=2025-05-01T17:11:38.415+02:00 level=DEBUG source=server.go:625 msg="model load progress 0.62" load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1 CUDA error: operation not supported current device: 1, in function ggml_backend_cuda_event_record at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:2839 cudaEventRecord((cudaEvent_t)event->context, cuda_ctx->stream()) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error time=2025-05-01T17:11:38.866+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding" time=2025-05-01T17:11:39.483+02:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409" time=2025-05-01T17:11:39.567+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="llama runner process has terminated: CUDA error" time=2025-05-01T17:11:39.567+02:00 level=DEBUG source=sched.go:460 msg="triggering expiration for failed load" model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T17:11:39.567+02:00 level=DEBUG source=sched.go:362 msg="runner expired event received" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T17:11:39.568+02:00 level=DEBUG source=sched.go:377 msg="got lock to unload" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac [GIN] 2025/05/01 - 17:11:39 | 500 | 13.7899546s | 127.0.0.1 | POST "/api/generate" time=2025-05-01T17:11:39.568+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.8 GiB" before.free_swap="128.1 GiB" now.total="127.9 GiB" now.free="120.6 GiB" now.free_swap="127.9 GiB" time=2025-05-01T17:11:39.589+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:39.609+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="219.4 MiB" releasing nvml library time=2025-05-01T17:11:39.636+02:00 level=DEBUG source=server.go:1001 msg="stopping llama server" time=2025-05-01T17:11:39.636+02:00 level=DEBUG source=sched.go:382 msg="runner released" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T17:11:39.864+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.6 GiB" before.free_swap="127.9 GiB" now.total="127.9 GiB" now.free="120.6 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:39.880+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:39.895+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:40.113+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.6 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:40.130+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:40.144+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:40.364+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:40.382+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:40.398+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:40.613+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:40.643+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:40.653+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:40.863+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:40.889+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:40.908+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:41.113+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:41.140+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:41.156+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:41.364+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:41.380+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:41.395+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:41.613+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:41.639+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:41.655+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:41.864+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:41.888+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:41.908+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:42.113+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:42.131+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:42.143+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:42.364+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:42.381+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:42.397+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:42.613+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:42.641+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:42.657+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:42.864+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:42.889+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:42.905+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:43.113+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:43.129+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:43.145+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:43.363+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:43.389+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:43.405+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:43.613+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:43.638+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:43.654+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:43.863+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:43.891+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:43.907+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:44.113+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:44.130+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:44.154+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:44.364+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:44.384+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:44.403+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:44.613+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0455091 model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T17:11:44.613+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:44.613+02:00 level=DEBUG source=sched.go:386 msg="sending an unloaded event" modelPath=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T17:11:44.614+02:00 level=DEBUG source=sched.go:310 msg="ignoring unload event with no pending requests" time=2025-05-01T17:11:44.638+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:44.654+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:44.864+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2955851 model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac time=2025-05-01T17:11:44.864+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="127.9 GiB" before.free="120.7 GiB" before.free_swap="128.0 GiB" now.total="127.9 GiB" now.free="120.7 GiB" now.free_swap="128.0 GiB" time=2025-05-01T17:11:44.881+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-962a842b-b382-6457-65a1-3cffec62ba6f name="NVIDIA RTX A2000 12GB" overhead="854.4 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="145.0 MiB" time=2025-05-01T17:11:44.896+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8c1fc6fe-a2c4-dd9a-559c-4f58836fd385 name="Quadro M5000" overhead="693.7 MiB" before.total="8.0 GiB" before.free="7.1 GiB" now.total="8.0 GiB" now.free="7.1 GiB" now.used="215.1 MiB" releasing nvml library time=2025-05-01T17:11:45.113+02:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5452212 model=C:\Ollama_Models\blobs\sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac ```

GiteaMirror commented

2026-04-29 03:09:04 -05:00

@neoxeo commented on GitHub (May 1, 2025):

Sorry, drivers is not the good one after swaped 2 cards :
Before : 576.02
After : 472.47 (Windows Nvidia Default Drivers)

I update them and try again

@neoxeo commented on GitHub (May 1, 2025): Sorry, drivers is not the good one after swaped 2 cards : Before : 576.02 After : 472.47 (Windows Nvidia Default Drivers) I update them and try again

GiteaMirror commented

2026-04-29 03:09:06 -05:00

@neoxeo commented on GitHub (May 1, 2025):

After update drivers I have same errors before I swap 2 cards (with CUDA_VISIBLE_DEVICES="0,1" or CUDA_VISIBLE_DEVICES="1,0")

Need to find a new graphic card...

Thank you very much @rick-github for your help even if problem not solve !

@neoxeo commented on GitHub (May 1, 2025): After update drivers I have same errors before I swap 2 cards (with CUDA_VISIBLE_DEVICES="0,1" or CUDA_VISIBLE_DEVICES="1,0") Need to find a new graphic card... Thank you very much @rick-github for your help even if problem not solve !

GiteaMirror commented

2026-04-29 03:09:07 -05:00

@neoxeo commented on GitHub (May 2, 2025):

I try to update to 0.6.7 version with no conviction.

@neoxeo commented on GitHub (May 2, 2025): I try to update to 0.6.7 version with no conviction.

GiteaMirror commented

2026-04-29 03:09:08 -05:00

@neoxeo commented on GitHub (May 2, 2025):

Same errors with 0.6.7

Next step, find an old 3060 12gb to replace my M5000.

@neoxeo commented on GitHub (May 2, 2025): Same errors with 0.6.7 Next step, find an old 3060 12gb to replace my M5000.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#53432